Re: Lucene 1483 and Auto resolution
Urgh, right. Can't we simply restore the AUTO resolution into it? Existing (direct) usage of it must be passing in the top-level IndexReader. (Lucene doesn't use FSHQ internally). Mike On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote: Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
I am unable to create index of an object having composite key
Hi all, I am using hibernate search with lucene. I need to create index of DomainTag object which have only one composite key. I am unware how to define the annotations for the composite key in DomainTag (pojo) class. If any one can help , please help me. Thanks in advance. My DomainTag.hbm.xml file is as follows: -- ?xml version=1.0? !DOCTYPE hibernate-mapping PUBLIC -//Hibernate/Hibernate Mapping DTD 3.0//EN http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd; hibernate-mapping !-- Created by the Middlegen Hibernate plugin 2.2 http://boss.bekk.no/boss/middlegen/ http://www.hibernate.org/ -- class name=com.test.manager.DomainTag table=domaintag composite-id !-- Associations -- !-- derived association(s) for compound key -- !-- bi-directional many-to-one association to Item -- !-- Associations -- !-- derived association(s) for compound key -- !-- bi-directional many-to-one association to Item -- key-many-to-one name=topic class=com.test.manager.DomainTest column name=domain_id / /key-many-to-one !-- bi-directional many-to-one association to Tag -- key-many-to-one name=tag class=com.test.manager.Tag column name=tag_id / /key-many-to-one /composite-id /class /hibernate-mapping -- -- View this message in context: http://www.nabble.com/I-am-unable-to-create-index-of-an-object-having-composite-key-tp23211575p23211575.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 1483 and Auto resolution
Ah, right - thats basically what I was suggesting, though I was thinking that we might need to resolve twice, but of course your right and we don't have to. Lucene does still use fshq for back compat (for custom field source I think?), but I can just do the auto resolution in IndexSearcher only in non legacy mode and restore auto resolution to fshq. That will avoid the harmless but wasteful double resolve in legacy mode. Had no code to look at when I sent that out. - Mark Michael McCandless wrote: Urgh, right. Can't we simply restore the AUTO resolution into it? Existing (direct) usage of it must be passing in the top-level IndexReader. (Lucene doesn't use FSHQ internally). Mike On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote: Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 1483 and Auto resolution
Eh - we have to spin through all the fields to check for legacy first anyway. Just doing it twice in legacy mode is probably best? I think there is no way to avoid spinning through the fields twice in any case (you have to spin through the sort fields to know you dont want to spin through the sort fields), so I guess I just go with that. - Mark Mark Miller wrote: Ah, right - thats basically what I was suggesting, though I was thinking that we might need to resolve twice, but of course your right and we don't have to. Lucene does still use fshq for back compat (for custom field source I think?), but I can just do the auto resolution in IndexSearcher only in non legacy mode and restore auto resolution to fshq. That will avoid the harmless but wasteful double resolve in legacy mode. Had no code to look at when I sent that out. - Mark Michael McCandless wrote: Urgh, right. Can't we simply restore the AUTO resolution into it? Existing (direct) usage of it must be passing in the top-level IndexReader. (Lucene doesn't use FSHQ internally). Mike On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote: Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 1483 and Auto resolution
I'm just putting the auto resolution back in fshq and there is no double check - foolish me. As I first said, we will either resolve first in back compat mode and it will be skipped in fshq, or it will be resolved in fshq for anyone using it externally (and not resolving first on there own). I'll commit the change shortly. Mark Miller wrote: Eh - we have to spin through all the fields to check for legacy first anyway. Just doing it twice in legacy mode is probably best? I think there is no way to avoid spinning through the fields twice in any case (you have to spin through the sort fields to know you dont want to spin through the sort fields), so I guess I just go with that. - Mark Mark Miller wrote: Ah, right - thats basically what I was suggesting, though I was thinking that we might need to resolve twice, but of course your right and we don't have to. Lucene does still use fshq for back compat (for custom field source I think?), but I can just do the auto resolution in IndexSearcher only in non legacy mode and restore auto resolution to fshq. That will avoid the harmless but wasteful double resolve in legacy mode. Had no code to look at when I sent that out. - Mark Michael McCandless wrote: Urgh, right. Can't we simply restore the AUTO resolution into it? Existing (direct) usage of it must be passing in the top-level IndexReader. (Lucene doesn't use FSHQ internally). Mike On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote: Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: I am unable to create index of an object having composite key
You'll do best to direct this question to the Hibernate group. java- dev is for Lucene development so not an appropriate Lucene place to ask. java-user would be better, but your question is more Hibernate specific. Erik On Apr 24, 2009, at 3:49 AM, gopalbisht wrote: Hi all, I am using hibernate search with lucene. I need to create index of DomainTag object which have only one composite key. I am unware how to define the annotations for the composite key in DomainTag (pojo) class. If any one can help , please help me. Thanks in advance. My DomainTag.hbm.xml file is as follows: -- ?xml version=1.0? !DOCTYPE hibernate-mapping PUBLIC -//Hibernate/Hibernate Mapping DTD 3.0//EN http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd; hibernate-mapping !-- Created by the Middlegen Hibernate plugin 2.2 http://boss.bekk.no/boss/middlegen/ http://www.hibernate.org/ -- class name=com.test.manager.DomainTag table=domaintag composite-id !-- Associations -- !-- derived association(s) for compound key -- !-- bi-directional many-to-one association to Item -- !-- Associations -- !-- derived association(s) for compound key -- !-- bi-directional many-to-one association to Item -- key-many-to-one name=topic class=com.test.manager.DomainTest column name=domain_id / /key-many-to-one !-- bi-directional many-to-one association to Tag -- key-many-to-one name=tag class=com.test.manager.Tag column name=tag_id / /key-many-to-one /composite-id /class /hibernate-mapping -- -- View this message in context: http://www.nabble.com/I-am-unable-to-create-index-of-an-object-having-composite-key-tp23211575p23211575.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702343#action_12702343 ] Shai Erera commented on LUCENE-1593: Few updates (it's been long since I posted on this issue): * I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something). The only reason to do it was to get rid of the 'if (queueFull)' check in every collect. However, it turned out that pre-populating the queue for TFC with sentinel values is unreliable. Reason is, if I want to get rid of that 'if', I don't want to add any 'if' to FieldComparator.compare, so the sentinel values should be something like MIN_VALUE or MAX_VALUE (depends on the value of 'reverse'). However, someone can set a field value to either of these values (and there are tests that do), so I need to check if FC if the current value is a sentinel, which adds that 'if' back, only worse - it's now executed in every compare() call. Unless I missed something, I don't think that's possible, or at least worth the effort (to get rid of one 'if'). ** BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken if a Scorer decided to return that value, in which case pre-populating the queue will not work either. I think it is still safe in TSDC, but want to get your opinion. * I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by Mike. But then TestSort.testMultiSort failed. So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher). ** MultiSearcher's search(Weight, Filter, int, Sort) method executes a search against all its Searchables, sequentially. The test adds documents to two indexes, odd and even, so that the odd index is added two documents (A and E) with the value 5 in docs 0 and 2, and the even index is added one doc (B) with value 5 in index 0. When I use the 2 SortField version (2nd sorts by doc), the output is ABE, since the compare by doc uses the searcher's doc Ids (0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' doc Id is 7 (it appears in the 2nd Searcher). When I use the 1 SortField version, the result is AEB, since B's fixed doc Id is 7, and now the code breaks a tie on their *true* doc Id. I hope you were able to follow my description. That's why I don't know which one is the true output ABE or AEB. I tend to say AEB, since those are the true doc Ids the application will get in the end. Few more comments: # TestSort.runMultiSort contains 3 versions of this test: #* One that sorts by the INT value only, relying on the doc Id tie breaking. You can see the output was not determined to be exact and so the pattern included is [ABE]{3}, as if the test's output is not predicted. I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code. #* The second explicitly sets INT followed by DOC Sorts, and expects [AB]{2}E. #* The third relies on setSort's adding DOC, so it expects the same [AB]{2}E. # The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but doesn't update the DOC's field value the same as it does to the scoreDoc.doc value (adding the current Searcher's start). Again, whatever the right output is, changing Sort to not include SortField.FIELD_DOC might result in someone experiencing a change in behavior (if he used MultiSearcher), even if that behavior is buggy. What do you think? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization
[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702379#action_12702379 ] Yonik Seeley commented on LUCENE-1609: -- Re: why the lazy load http://www.lucidimagination.com/search/document/97e73361748808b/terminfosreader_lazy_term_index_reading#2a73aaca25d516ec Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead --- Key: LUCENE-1609 URL: https://issues.apache.org/jira/browse/LUCENE-1609 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Environment: Solr Tomcat 5.5 Ubuntu 2.6.20-17-generic Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM Reporter: Dan Rosher Attachments: LUCENE-1609.patch synchronized method ensureIndexIsRead in TermInfosReader causes contention under heavy load Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple range search e.g. id:[0 TO 99] on even a small index (in my case 28K docs) and under a load/stress test application, and later, examining the Thread dump (kill -3) , many threads are blocked on 'waiting for monitor entry' to this method. Rather than using Double-Checked Locking which is known to have issues, this implementation uses a state pattern, where only one thread can move the object from IndexNotRead state to IndexRead, and in doing so alters the objects behavior, i.e. once the index is loaded, the index nolonger needs a synchronized method. In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702404#action_12702404 ] Michael McCandless commented on LUCENE-1609: If it's only for segment merging, couldn't we up front conditionalize the loading of the index? Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead --- Key: LUCENE-1609 URL: https://issues.apache.org/jira/browse/LUCENE-1609 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Environment: Solr Tomcat 5.5 Ubuntu 2.6.20-17-generic Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM Reporter: Dan Rosher Attachments: LUCENE-1609.patch synchronized method ensureIndexIsRead in TermInfosReader causes contention under heavy load Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple range search e.g. id:[0 TO 99] on even a small index (in my case 28K docs) and under a load/stress test application, and later, examining the Thread dump (kill -3) , many threads are blocked on 'waiting for monitor entry' to this method. Rather than using Double-Checked Locking which is known to have issues, this implementation uses a state pattern, where only one thread can move the object from IndexNotRead state to IndexRead, and in doing so alters the objects behavior, i.e. once the index is loaded, the index nolonger needs a synchronized method. In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 1483 and Auto resolution
OK that looks right. Though, maybe you should add javadocs to FieldValueHitQueue saying it does *not* do AUTO resolution? (Since we've deprecated FSHQ in favor of FVHQ, I think we should call out this difference?). And state the workaround (calling FVHQ.detectFieldType yourself)? Mike On Fri, Apr 24, 2009 at 8:25 AM, Mark Miller markrmil...@gmail.com wrote: I'm just putting the auto resolution back in fshq and there is no double check - foolish me. As I first said, we will either resolve first in back compat mode and it will be skipped in fshq, or it will be resolved in fshq for anyone using it externally (and not resolving first on there own). I'll commit the change shortly. Mark Miller wrote: Eh - we have to spin through all the fields to check for legacy first anyway. Just doing it twice in legacy mode is probably best? I think there is no way to avoid spinning through the fields twice in any case (you have to spin through the sort fields to know you dont want to spin through the sort fields), so I guess I just go with that. - Mark Mark Miller wrote: Ah, right - thats basically what I was suggesting, though I was thinking that we might need to resolve twice, but of course your right and we don't have to. Lucene does still use fshq for back compat (for custom field source I think?), but I can just do the auto resolution in IndexSearcher only in non legacy mode and restore auto resolution to fshq. That will avoid the harmless but wasteful double resolve in legacy mode. Had no code to look at when I sent that out. - Mark Michael McCandless wrote: Urgh, right. Can't we simply restore the AUTO resolution into it? Existing (direct) usage of it must be passing in the top-level IndexReader. (Lucene doesn't use FSHQ internally). Mike On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote: Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702417#action_12702417 ] Michael McCandless commented on LUCENE-1593: bq. I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something) I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom? For String we should be able to use U+. bq. So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher). Hmm -- good catch! I think this is in fact a bug in MultiSearcher (AEB is the right answer): it's failing to break ties (by docID) properly. Ie, it will not match what IndexSearcher(MultiSegmentReader(...)) will do. I think we could fix this by allowing one to pass in a docbase when searching? Though that's a more involved change... could you open a new issue for this one? bq. I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code. I agree -- let's fix it to be a deterministic test? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 1483 and Auto resolution
Will do. Michael McCandless wrote: OK that looks right. Though, maybe you should add javadocs to FieldValueHitQueue saying it does *not* do AUTO resolution? (Since we've deprecated FSHQ in favor of FVHQ, I think we should call out this difference?). And state the workaround (calling FVHQ.detectFieldType yourself)? Mike On Fri, Apr 24, 2009 at 8:25 AM, Mark Miller markrmil...@gmail.com wrote: I'm just putting the auto resolution back in fshq and there is no double check - foolish me. As I first said, we will either resolve first in back compat mode and it will be skipped in fshq, or it will be resolved in fshq for anyone using it externally (and not resolving first on there own). I'll commit the change shortly. Mark Miller wrote: Eh - we have to spin through all the fields to check for legacy first anyway. Just doing it twice in legacy mode is probably best? I think there is no way to avoid spinning through the fields twice in any case (you have to spin through the sort fields to know you dont want to spin through the sort fields), so I guess I just go with that. - Mark Mark Miller wrote: Ah, right - thats basically what I was suggesting, though I was thinking that we might need to resolve twice, but of course your right and we don't have to. Lucene does still use fshq for back compat (for custom field source I think?), but I can just do the auto resolution in IndexSearcher only in non legacy mode and restore auto resolution to fshq. That will avoid the harmless but wasteful double resolve in legacy mode. Had no code to look at when I sent that out. - Mark Michael McCandless wrote: Urgh, right. Can't we simply restore the AUTO resolution into it? Existing (direct) usage of it must be passing in the top-level IndexReader. (Lucene doesn't use FSHQ internally). Mike On Thu, Apr 23, 2009 at 6:48 PM, Mark Miller markrmil...@gmail.com wrote: Just got off the train and ny to ct has a brilliant bar car, so lest I forget: 1483 moved auto resolution from fshq to indexsearcher - which is a back compat break if you were using a fshq without indexsearcher (Solr does it - anyone could). Annoying. If I remember right, I did it to resolve auto on the multireader rather than each individual segment reader. So the change is needed and not allowed. Perhaps it could just re-resolve like before though - if indexsearcher has already resolved, fine, otherwise it will be done again at the fshq level. Ill issue it up later. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Fuzzy search optimization
Please do! Mike On Thu, Apr 23, 2009 at 7:13 AM, Varun Dhussa va...@mapmyindia.com wrote: Hi, I was going through the Levenshtein distance code in org.apache.lucene.search.FuzzyTermEnum.java of the 2.4.1 build. I noticed that there can be a small, but effective optimization to the distance calculation code (initialization). I have the code ready with me. I can post it if anyone is interested. Thanks and regards Varun Dhussa Product Architect CE InfoSystems (P) Ltd. http://maps.mapmyindia.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702418#action_12702418 ] Michael McCandless commented on LUCENE-1593: bq. Actually, I think BooleanScorer need not process docs out of order? The only out of order ness seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think? I was wrong about this -- you also get out of order ness due to 2nd, 3rd, etc. clauses in the boolean query adding in new docs. Re-ordering those will be costly. But: in the initial collection of each chunk, we do know that any docID in the queue will be less than those being visited, so this could be a good future optimization for immediately ruling out docs during TermScorer. That's a larger change... eg it'd require being able to say I don't need exact total hit count for this search. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702436#action_12702436 ] Michael McCandless commented on LUCENE-1539: {quote} Yeah? Ok. So the deleteDocsByPercent method needs to somehow take into account whether it's deleted before by adjusting the doc nums it's deleting? {quote} How about randomly choosing docs to delete instead of every N? Then you don't need to keep track? {quote} I don't think we can relax that. This (single transaction (writer) open at once) is a core assumption in Lucene. True, however doesn't mean we have to stick with it, especially internally. Hopefully we can move to a more componentized model someone could change this if they wanted. Perhaps in the flexible indexing revamp {quote} We'd need to figure out how to get multiple writers to properly cooperate. Actually Marvin is working on something like this (for KS/Lucy), where one lightweight writer can do adds/deletes/small merges, and a separate heavyweight writer does large merges. Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702448#action_12702448 ] Shai Erera commented on LUCENE-1593: bq. I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom? That's not so simple. Let's say I initialize the sentinel of IntComp to Integer.MAX_VALUE. That should have guaranteed that any 'int' MAX_VAL would compare better. But the code in TFC compares the current doc against the 'bottom'. For all Sentinels, it means MAX_VAL. If the input doc's val MAX_VAL, it compares better. Otherwise, it is rejected, because: # If it is bigger than the bottom, it should be rejected. # If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected. Actually, the tie is broken only after it's in queue, when the latter calls compare(). This assumption removed the 'if' that checked for doc Id value, so if I reinstate it, we don't really gain anything, right (replacing 'if (queueFull)' with 'if (bottom.doc doc + docBase)')? bq. For String we should be able to use U+. If we resolve the core issues with sentinel values, than this will be the value I'd use for Strings, right. bq. I think we could fix this by allowing one to pass in a docbase when searching? I actually would like to propose the following: MultiSearcher already fixes the FD.doc before inserting it to the queue. I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC. This can happen just after searcher.search() returns, and before MS adds the results to its own FieldDocSortedHitQueue. I already did it, and all the testMultiSearch cases fail, but that's because they just assert that the bug exists :). If you think a separate issue is still required, I can do it, but that would mean that the tests will fail until we fix it, or I don't change Sort in this issue and do it as part of the other one. bq. let's fix it to be a deterministic test? Will do, but it depends - if a new issue is required, I'll do it there. bq. I was wrong about this I must say I still didn't fully understand what do you mean here. I intended to keep that to after everything else will work in that issue's scope, and note if there are any tests that fail, or BQ actually behaves properly. So I'll simply count on what you say is true :), since I'm not familiar with that code. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702461#action_12702461 ] Michael McCandless commented on LUCENE-1516: bq. We need an IndexWriter.getMergedSegmentWarmer method? Yes, I just committed; thanks! Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702471#action_12702471 ] Uwe Schindler commented on LUCENE-1575: --- Hi, Shalin found a backwards-incompatible change in the Searcher abstract class, I noticed this from his SVN comment for SOLR-940 (where he updated to Lucene trunk): {code} abstract public void search(Weight weight, Filter filter, Collector results) throws IOException; {code} This should not be abstract for backwards compatibility, but instead throw an UnsupportedOperationException or have a default implementation that somehow wraps the Collector using an old HitCollector (not very nice, I do not know how to fix this in any other way). Before 3.0, where this change would be ok, the Javadocs should note, that the deprecated HitCollector API will be removed and the Collector part will be made abstract. If this method stays abstract, you cannot compile old code or replace lucene jars (this is seldom, as almost nobody creates private implementations of Searcher, but Solr does... Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call
[jira] Reopened: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened LUCENE-1575: --- Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702488#action_12702488 ] Shai Erera commented on LUCENE-1575: If you read somewhere above, you'll see that we've discussed this change. It seems that whatever we do, anyone who upgrades to 2.9 will need to touch his code. If you extend Searcher, you'll need to override that new method, regardless of what we choose to do. That's because if it's abstract, you need to implement it, and it it's concrete (throwing UOE), you'll need to override it since all the Searcher methods call this one at the end. I'm not sure wrapping a Collector with HitCollector will work, because all of the other classes now expect Collector, and their HitCollector variants call the Collector one with a HitCollectorWrapper (which is also deprecated). We agreed that making it abstract is the lesser of all evils, since you'll spot it at compile time, rather than at runtime, when you'll hit a UOE. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think
Another possible optimization - now in DocIdSetIterator
Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702496#action_12702496 ] Jason Rutherglen commented on LUCENE-1313: -- For this patch I'm debating whether to add a package protected IndexWriter.addIndexWriter method. The problem is, the RAMIndex blocks on the write to disk during IW.addIndexesNoOptimize which if we're using ConcurrentMergeScheduler shouldn't happen? Meaning in this proposed solution, if segments keep on piling up in RAMIndex, we simply move them over to the disk IW which will in the background take care of merging them away and to disk. I don't think it's necessary to immediately write ram segments to disk (like the current patch does), instead it's possible to simply copy segments over from the incoming IW, leave them in RAM and they can be merged to disk as necessary? Then on IW.flush any segmentinfo(s) that are not from the current directory can be flushed to disk? Just thinking out loud about this. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
Shai Erera wrote: Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai any micro benchmarks or anything? If its a net real world win, +1. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
Mark Miller wrote: Shai Erera wrote: Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai any micro benchmarks or anything? If its a net real world win, +1. P.S. I'd make a new issue myself. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702503#action_12702503 ] Michael McCandless commented on LUCENE-1593: bq. If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected. Argh, you are right... so the approach will fail if any of the first topN (eg 10) hits have a field value equal to the sentinel value. I guess we could do two separate passes: a startup pass (while queue is filling up) and a the rest pass that knows the queue is full. But that's getting rather ugly; probably we should leave this optimization to source code specialization. bq. I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC. Excellent! Let's just do this as part of this issue. bq. I must say I still didn't fully understand what do you mean here. I also did not understand how BooleanScorer works until Doug explained it (see the comment at the top). Right now it gathers hits of each clause in a reversed linked list, which it then makes a 2nd pass to collect. So the Collector will see docIDs in reverse order for that clause. I thought we could simply fix the linking to be forward and we'd have docIDs in order. But that isn't quite right because any new docIDs hit by the 2nd clause will be inserted at the end of the linked list, out of order. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
I think this is a good idea! I think a new issue is best. Mike On Fri, Apr 24, 2009 at 3:26 PM, Shai Erera ser...@gmail.com wrote: Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 status (to port to Lucene.Net)
George, did you mean LUCENE-1516 below? (LUCENE-1313 is a further improvement to near real-time search that's still being iterated on). In general I would say 2.9 seems to be in rather active development still ;) I too would love to hear about production/beta use of 2.9. George maybe you should re-ask on java-user? Mike On Sat, Apr 18, 2009 at 7:12 PM, George Aroush geo...@aroush.net wrote: Thanks all for your input on this subject. So, if I decide to grab the current code off the trunk, is it: 1) Usable for production use? 2) Is LUCENE-1313 (Realtime search), in the current trunk, stable and ready for use? Put another way, is anyone using the current trunk code in production, or even as beta? -- George From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Thursday, April 16, 2009 5:13 PM To: java-dev@lucene.apache.org Subject: Re: Lucene 2.9 status (to port to Lucene.Net) LUCENE-1313 relies on LUCENE-1516 which is in trunk. If you have other questions George, feel free to ask. On Thu, Apr 16, 2009 at 8:04 AM, George Aroush geo...@aroush.net wrote: Thanks Mike. A quick follow up question. What's the status of http://issues.apache.org/jira/browse/LUCENE-1313? Can this work be applied to Lucene 2.4.1 and still get it's benefit or are there other dependency / issues with it that prevents us from doing so? If anyone else knows, I welcome your input. -- George -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, April 16, 2009 8:36 AM To: java-dev@lucene.apache.org Subject: Re: Lucene 2.9 status (to port to Lucene.Net) Hi George, There's been a sudden burst of activity lately on 2.9 development... I know there are some biggish remaining features we may want to get into 2.9: * The new field cache (LUCENE-831; still being iterated/mulled), * Possible major rework of Field / Document index-time vs search-time Document * Applying filters via random-access API when possible performant (LUCENE-1536) * Possible further optimizations to how collection works (LUCENE-1593) * Maybe breaking core + contrib into a more uniform set of modules (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here) -- the Modularization uber-thread. * Further improvements to near-realtime search (using RAMDir for small recently flushed segments) * Many other small things and probably some big ones that I'm forgetting now :) So things are still in flux, and I'm really not sure on a release date at this point. Late last year, I was hoping for early this year, but it's no longer early this year ;) Mike On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote: Hi Folks, This is George Aroush, I'm one of the committers on Lucene.Net - a port of Java Lucene to C# Lucene. I'm looking at the current trunk code of yet to be released Lucene 2.9 and I would like to port it to Lucene.Net. If I do this now, we get the benefit of keeping our code base and release dates much closer to Java Lucene. However, this comes with a cost of carrying over unfinished work, known defects, and I have to keep an eye on new code that get committed into Java Lucene which must be ported over in a timely fashion. To help me determine when is a good time to start the port -- keep in mind, I will be taking the latest code off SVN -- I like to hear from the Java Lucene committers (and users who are playing or using Lucene 2.9 off SVN) about those questions: 1) how stable the current code in the trunk is, 2) do you still have feature work to deliver or just bug fixes, and 3) what's your target date to release Java Lucene 2.9 #1 is important, such that is anyone using it in production? Yes, I did look at the current open issues in JIRA, but that doesn't help me answer the above questions. Regards, -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702509#action_12702509 ] Michael McCandless commented on LUCENE-1575: bq. Shalin found a backwards-incompatible change in the Searcher abstract class We could go either way on this... the evils were strong with either choice, and we struggled and eventually went with adding abstract method today, for the reasons Shai enumerated. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702515#action_12702515 ] Michael McCandless commented on LUCENE-1313: bq. I don't think it's necessary to immediately write ram segments to disk I agree: it should be fine from IndexWriter's standpoint if some segments live in a private RAMDir and others live in the real dir. In fact, early versions of LUCENE-843 did exactly this: IW's RAM buffer is not as efficient as a written segment, and so you can gain some RAM efficiency by flushing first to RAM and then merging to disk. I think we could adopt a simple criteria: you flush the new segment to the RAM Dir if net RAM used is maxRamBufferSizeMB. This way no further configuration is needed. On auto-flush triggering you then must take into account the RAM usage by this RAM Dir. On commit, these RAM segments must be migrated to the real dir (preferably by forcing a merge, somehow). A near realtime reader would also happily mix real Dir and RAMDir SegmentReaders. This should work well I think, and should not require a separate RAMIndex class, and won't block things when the RAM segments are migrated to disk by CMS. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1575. --- Resolution: Fixed OK, I resolve the issue again. I was just wondering and wanted to be sure. I also missed the first entry in CHANGES.txt, which explains it. It is the same like with the Fieldable interface in the past, it is seldom implemented/overwritten and so the normal user will not be affected. And those who implement Fieldable or extend Searcher must implement it. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically
RE: Another possible optimization - now in DocIdSetIterator
Maybe combine this with the isRandomAccess change it DocIdSet? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, April 24, 2009 9:46 PM To: java-dev@lucene.apache.org Subject: Re: Another possible optimization - now in DocIdSetIterator I think this is a good idea! I think a new issue is best. Mike On Fri, Apr 24, 2009 at 3:26 PM, Shai Erera ser...@gmail.com wrote: Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shon Vella updated LUCENE-1604: --- Attachment: LUCENE-1604.patch Updated patch that preserves disableNorms flag across clone and reopen() and applies flag transitively to MultiSegmentReader. Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702520#action_12702520 ] Michael McCandless commented on LUCENE-1252: Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score. I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks AND-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted. Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
On the it touches the same code criteria, I would agree. On the it's the same core problem criteria, I would disagree. Also I would think this change is simpler than the isRandomAccess addition and so probably would land before isRandomAccess... so I think I'd lean towards keeping them as separate issues. Mike On Fri, Apr 24, 2009 at 4:06 PM, Uwe Schindler u...@thetaphi.de wrote: Maybe combine this with the isRandomAccess change it DocIdSet? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, April 24, 2009 9:46 PM To: java-dev@lucene.apache.org Subject: Re: Another possible optimization - now in DocIdSetIterator I think this is a good idea! I think a new issue is best. Mike On Fri, Apr 24, 2009 at 3:26 PM, Shai Erera ser...@gmail.com wrote: Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene 2.9 status (to port to Lucene.Net)
George, did you mean LUCENE-1516 below? (LUCENE-1313 is a further improvement to near real-time search that's still being iterated on). In general I would say 2.9 seems to be in rather active development still ;) I too would love to hear about production/beta use of 2.9. George maybe you should re-ask on java-user? Here! I updated www.pangaea.de to Lucene-trunk today (because of incomplete hashcode in TrieRangeQuery)... Works perfect, but I do not use the realtime parts. And 10 days before the same, no problems :-) Currently I rewrite parts of my code to Collector to go away from HitCollector (without score, so optimizations)! The reopen() and sorting is fine, almost no time is consumed for sorted searches after reopening indexes every 20 minutes with just some new and small segments with changed documents. No extra warming is needed. Another change to be done here is Field.Store.COMPRESS and replace by manually compressed binary stored fields, but this is only to get rid of the deprecated warnings. But this cannot be done without complete reindexing. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
On Fri, Apr 24, 2009 at 10:26:21PM +0300, Shai Erera wrote: I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). What do you think? It'll work. Nathan Kurz proposed exactly this change for KinoSearch last July. http://rectangular.com/pipermail/kinosearch/2007-July/004149.html I think there is a small gain by having Scorer_Advance return a doc number directly rather than a boolean, obviating the need for a follow-up call to Scorer_Doc. I finished the implementation last October. One additional wrinkle, though: doc nums start at 1 rather than 0, so the return values for Next() and Advance() can double as a booleans. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702526#action_12702526 ] Michael McCandless commented on LUCENE-831: --- {quote} Grandma! But yeah we need to somehow support probably plain Java objects rather than every primitive derivative? {quote} You mean big arrays (one per doc) of plain-java-objects? Is Bobo doing that today? Or do you mean a single Java obect that, internally, deals with lookup by docID? {quote} (In reference to Mark's post 2nd to last post) Bobo efficiently nicely calculates facets for multiple values per doc which is the same thing as multi value faceting? {quote} Neat. How do you compactly represent (in RAM) multiple values per doc? {quote} Are norms and deletes implemented? These would just be byte arrays in the current approach? If not how would they be represented? It seems like for deleted docs we'd want the BitVector returned from a ValueSource.get type of method? {quote} The current patch doesn't do this -- but we should think about how this change could absorb norms/deleted docs, in the future. We would add a bit variant of getXXX (eg that returns BitVector, BitSet, something). {quote} Hmm... Does this mean we'd replace the current IndexReader method of performing updates on norms and deletes with this more generic update mechanism? {quote} Probably we'd still leave the sugar APIs in place, but under the hood their impls would be switched to this. bq. It would be cool to get CSF going? Most definitely!! Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: CHANGES.txt
On Fri, Apr 17, 2009 at 1:46 PM, Steven A Rowe sar...@syr.edu wrote: A few random observations about CHANGES.txt and the generated CHANGES.html: - The ü in Christian Kohlschütter's name is not proper UTF-8 (maybe it's double-encoded or something) in the two LUCENE-1186 mentions in the Trunk section, though it looks okay in the LUCENE-1186 mention in the 2.4.1 release section. (Déjà-vu all over again, non?) OK I committed this fix. Yes, definitely déjà-vu! - Five issues (LUCENE-1186, 1452, 1453, 1465 and 1544) are mentioned in both the 2.4.1 section and in the Trunk section. AFAICT, it has not been standard practice to mention bug fixes on a major or minor release (which Trunk will become) if they are mentioned on a prior patch release. Hmm -- I thought it'd be good to be clear on which bugs were fixed, where, even if it causes some redundancy? - The perl script that generates Changes.html (changes2html.pl) finds list items using a regex like /\s*\d+\.\s+/, but there is one list item in CHANGES.txt (#4 under Bug fixes in the Trunk section, for LUCENE-1453) that doesn't match this regex, since it's missing the trailing period ( 4 ), and so it's interpreted as just another paragraph in the previous list item. To fix this, either the regex should be changed, or 4 should be changed to 4.. (I prefer the latter, since this is the only occurrence, and it has never been part of a release.) I committed this fix. - The Trunk section sports use of a new feature: code sections, for the two mentions of LUCENE-1575. This looks fine in the text rendering, but looks crappy in the HTML version, since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML rendering, but the newlines in the code are converted to a single space. I think this should be fixed by modifying changes2html.pl to convert code and /code into (unescaped) codepre and /pre/code, respectively, since just passing through code and /code, without /?pre, while changing the font to monospaced (nice), still collapses whitespace (not nice). (There is a related question: should all HTML tags in CHANGES.txt be passed through without being escaped? I don't think so; better to handle them on a case-by-case basis, as the need arises.) Can you make a patch for codepre.../pre/code? (I like that approach). I agree let's not make it generic to all HTML tags for now... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1610) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl
Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl Key: LUCENE-1610 URL: https://issues.apache.org/jira/browse/LUCENE-1610 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.9 Reporter: Steven Rowe Priority: Trivial Fix For: 2.9 The Trunk section of CHANGES.txt sports use of a new feature: code sections, for the two mentions of LUCENE-1575. This looks fine in the text rendering, but looks crappy in the HTML version, since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML rendering, but the newlines in the code are converted to a single space. I think this should be fixed by modifying changes2html.pl to convert code and /code into (unescaped) codepre and /pre/code, respectively, since just passing through code and /code, without /?pre, while changing the font to monospaced (nice), still collapses whitespace (not nice). See the java-dev thread that spawned this issue here: http://www.nabble.com/CHANGES.txt-td23102627.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
On Fri, Apr 24, 2009 at 4:20 PM, Marvin Humphrey mar...@rectangular.com wrote: One additional wrinkle, though: doc nums start at 1 rather than 0, so the return values for Next() and Advance() can double as a booleans. Meaning they return 0 to indicate no more docs? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1610) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl
[ https://issues.apache.org/jira/browse/LUCENE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1610: Attachment: LUCENE-1610.patch Implements the suggested fix: code is converted to codepre (instead of to amp;lt;codeamp;gt; ) and /code is converted to /pre/code (instead of to amp;lt;/codeamp;gt; ) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl Key: LUCENE-1610 URL: https://issues.apache.org/jira/browse/LUCENE-1610 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.9 Reporter: Steven Rowe Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1610.patch The Trunk section of CHANGES.txt sports use of a new feature: code sections, for the two mentions of LUCENE-1575. This looks fine in the text rendering, but looks crappy in the HTML version, since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML rendering, but the newlines in the code are converted to a single space. I think this should be fixed by modifying changes2html.pl to convert code and /code into (unescaped) codepre and /pre/code, respectively, since just passing through code and /code, without /?pre, while changing the font to monospaced (nice), still collapses whitespace (not nice). See the java-dev thread that spawned this issue here: http://www.nabble.com/CHANGES.txt-td23102627.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
But I thought doc Ids start with 0? That's why I wrote 'set to -1' ... On Sat, Apr 25, 2009 at 12:18 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, Apr 24, 2009 at 4:20 PM, Marvin Humphrey mar...@rectangular.com wrote: One additional wrinkle, though: doc nums start at 1 rather than 0, so the return values for Next() and Advance() can double as a booleans. Meaning they return 0 to indicate no more docs? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
On Fri, Apr 24, 2009 at 05:18:30PM -0400, Michael McCandless wrote: One additional wrinkle, though: doc nums start at 1 rather than 0, so the return values for Next() and Advance() can double as a booleans. Meaning they return 0 to indicate no more docs? Yes. 0 is our sentinel. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
On Sat, Apr 25, 2009 at 12:24:59AM +0300, Shai Erera wrote: But I thought doc Ids start with 0? That's why I wrote 'set to -1' ... This was in the context of KinoSearch. In KS, doc numbers start at 1. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME
Do not launch new merges if IndexWriter has hit OOME Key: LUCENE-1611 URL: https://issues.apache.org/jira/browse/LUCENE-1611 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 if IndexWriter has hit OOME, it defends itself by refusing to commit changes to the index, including merges. But this can lead to infinite merge attempts because we fail to prevent starting a merge. Spinoff from http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
It's a nice approach, but I think it relies on C interpreting integer 0 as false, which we can't do in Java. (And, we lack unsigned int in java so we have immense freedom to pick any negative number as our sentinel ;) ). Not to mention it'd be a scary change to make at this point! So I think we should just stick with -1 as our sentinel. Mike On Fri, Apr 24, 2009 at 5:33 PM, Marvin Humphrey mar...@rectangular.com wrote: On Fri, Apr 24, 2009 at 05:18:30PM -0400, Michael McCandless wrote: One additional wrinkle, though: doc nums start at 1 rather than 0, so the return values for Next() and Advance() can double as a booleans. Meaning they return 0 to indicate no more docs? Yes. 0 is our sentinel. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: CHANGES.txt
Hi Mike, On 4/24/2009 at 4:45 PM, Michael McCandless wrote: On Fri, Apr 17, 2009 at 1:46 PM, Steven A Rowe sar...@syr.edu wrote: - Five issues (LUCENE-1186, 1452, 1453, 1465 and 1544) are mentioned in both the 2.4.1 section and in the Trunk section. AFAICT, it has not been standard practice to mention bug fixes on a major or minor release (which Trunk will become) if they are mentioned on a prior patch release. Hmm -- I thought it'd be good to be clear on which bugs were fixed, where, even if it causes some redundancy? Right: SUM(+1 clarity, -0.5 redundancy) = +0.5 :) So the policy you're suggesting is: When backporting bug fixes from trunk to a patch version, make note of the change in both the trunk and patch version sections of CHANGES.txt, right? Makes sense (though as I noted, this policy has never before been used), but why then did you include only 5 out of the 15 bug fixes listed under 2.4.1 in the Trunk section? - The Trunk section sports use of a new feature: code sections, for the two mentions of LUCENE-1575. This looks fine in the text rendering, but looks crappy in the HTML version, since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML rendering, but the newlines in the code are converted to a single space. I think this should be fixed by modifying changes2html.pl to convert code and /code into (unescaped) codepre and /pre/code, respectively, since just passing through code and /code, without /?pre, while changing the font to monospaced (nice), still collapses whitespace (not nice). (There is a related question: should all HTML tags in CHANGES.txt be passed through without being escaped? I don't think so; better to handle them on a case-by-case basis, as the need arises.) Can you make a patch for codepre.../pre/code? (I like that approach). I agree let's not make it generic to all HTML tags for now... Done: https://issues.apache.org/jira/browse/LUCENE-1610 Steve - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702573#action_12702573 ] Jason Rutherglen commented on LUCENE-1313: -- {quote} I think we could adopt a simple criteria: you flush the new segment to the RAM Dir if net RAM used is maxRamBufferSizeMB. This way no further configuration is needed. On auto-flush triggering you then must take into account the RAM usage by this RAM Dir. {quote} So we're ok with the blocking that occurs when the ram buffer is flushed to the ramdir? {quote}On commit, these RAM segments must be migrated to the real dir (preferably by forcing a merge, somehow). {quote} This is pretty much like resolveExternalSegments which would be called in prepareCommit? This could make calls to commit much more time consuming. It may be confusing to the user why IW.flush doesn't copy the ram segments to disk. {quote}A near realtime reader would also happily mix real Dir and RAMDir SegmentReaders.{quote} Agreed, however the IW.getReader MultiSegmentReader removes readers from another directory so we'd need to add a new attribute to segmentinfo that marks it as ok for inclusion in the MSR? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
Hi Shai, absolutely! we have been there, and there are already some micro benchmarks done in LUCENE-1345 just do not forget to use -1 doc instead of -1 != doc, trust me, Yonik convinced me :) as a side effect, this change would have some positive effects on iterator semantics, prevents, very hard to find, one off bugs caused by calling doc() before calling next(). we had quite a few of those. -1 is good, as it supports move to the first in next() without if(initialized), just incrementing . cheers, eks From: Shai Erera ser...@gmail.com To: java-dev@lucene.apache.org Sent: Friday, 24 April, 2009 21:26:21 Subject: Another possible optimization - now in DocIdSetIterator Hi I think we can make some optimization to DocIdSetIterator. Today, it defines next() and skipTo(int) which return a boolean. I've checked the code and it looks like almost always when these two are called, they are followed by a call to doc(). I was thinking that if those two returned the doc Id they are at, instead of boolean, that will save the call to doc(). Those that use these can: * Compare doc to a NO_MORE_DOCS constant (set to -1), to understand there are no more docs in this iterator. * If skipTo() is called, compare the 'target' to the returned Id, and if they are not the same, save it so that the next skipTo is requested, they don't perform it if the returned Id is greater than the target. If it's not possible to save it, they can call doc() to get that information. The way I see it, the impls that will still need to call doc() will lose nothing. All we'll do is change the 'if' from comparing a boolean to comparing ints (even though that's a bit less efficient than comparing booleans). The impls that call doc() just because all they have in hand is a boolean, will gain. Obviously we can't change those methods' signature, so we can deprecate them and intorudce nextDoc() and skipToDoc(int target). We should still keep doc() around though. What do you think? If you agree to this change, I can add them to 1593, or create a new issue (I prefer the latter so that 1593 will be focused on the changes to Collectors). Shai
[jira] Updated: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME
[ https://issues.apache.org/jira/browse/LUCENE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1611: --- Attachment: LUCENE-1611.patch Attached patch to prevent starting new merges after OOME, and to throw IllegalStateException in optimize, expungeDeletes if OOME is hit. I plan to commit in a day or two. Do not launch new merges if IndexWriter has hit OOME Key: LUCENE-1611 URL: https://issues.apache.org/jira/browse/LUCENE-1611 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1611.patch if IndexWriter has hit OOME, it defends itself by refusing to commit changes to the index, including merges. But this can lead to infinite merge attempts because we fail to prevent starting a merge. Spinoff from http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1610) Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl
[ https://issues.apache.org/jira/browse/LUCENE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1610. Resolution: Fixed Thanks Steve! Preserve whitespace in code sections in the Changes.html generated from CHANGES.txt by changes2html.pl Key: LUCENE-1610 URL: https://issues.apache.org/jira/browse/LUCENE-1610 Project: Lucene - Java Issue Type: Improvement Components: Website Affects Versions: 2.9 Reporter: Steven Rowe Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1610.patch The Trunk section of CHANGES.txt sports use of a new feature: code sections, for the two mentions of LUCENE-1575. This looks fine in the text rendering, but looks crappy in the HTML version, since changes2html.pl escapes HTML metacharacters to appear as-is in the HTML rendering, but the newlines in the code are converted to a single space. I think this should be fixed by modifying changes2html.pl to convert code and /code into (unescaped) codepre and /pre/code, respectively, since just passing through code and /code, without /?pre, while changing the font to monospaced (nice), still collapses whitespace (not nice). See the java-dev thread that spawned this issue here: http://www.nabble.com/CHANGES.txt-td23102627.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702579#action_12702579 ] Michael McCandless commented on LUCENE-1313: {quote} So we're ok with the blocking that occurs when the ram buffer is flushed to the ramdir? {quote} Well... we don't have a choice (unless/until we implement IndexReader impl to directly search the RAM buffer). Still, this should be a good improvement over the blocking when flushing to a real dir. {quote} This is pretty much like resolveExternalSegments which would be called in prepareCommit? This could make calls to commit much more time consuming. It may be confusing to the user why IW.flush doesn't copy the ram segments to disk. {quote} Similar... the difference is I'd prefer to do a merge of the RAM segments vs the straight one-for-one copy that resolveExternalSegments does. commit would only become more time consuming in the NRT case? IE we'd only flush-to-RAMdir if it's getReader that's forcing the flush? In which case, I think it's fine that commit gets more costly. Also, I wouldn't expect it to be much more costly: we are doing an in-memory merge of N segments, writing one segment to the real directory. Vs writing each tiny segment as a real one. In fact, commit could get cheaper (when compared to not making this change) since there are fewer new files to fsync. {quote} Agreed, however the IW.getReader MultiSegmentReader removes readers from another directory so we'd need to add a new attribute to segmentinfo that marks it as ok for inclusion in the MSR? {quote} Or, fix that filtering to also accept IndexWriter's RAMDir. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702596#action_12702596 ] Jason Rutherglen commented on LUCENE-1313: -- I'm confused as to how we make DocumentsWriter switch from writing to disk vs the ramdir? It seems like a fairly major change to the system? One that's hard to switch later on after IW is instantiated? Perhaps the IW.addWriter method is easier in this regard? {quote} the difference is I'd prefer to do a merge of the RAM segments vs the straight one-for-one copy that resolveExternalSegments does.{quote} Yeah I implemented it this way in the IW.addWriter code. I agree it's better for IW.commit to copy all the ramdir segments to one disk segment. I started working on the IW.addWriter(IndexWriter, boolean removeFrom) where removeFrom removes the segments that have been copied to the destination writer from the source writer. This method gets around the issue of blocking because potentially several writers could concurrently be copied to the destination writer. The only issue at this point is how the destination writer obtains segmentreaders from source readers when they're in the other writers' pool? Maybe the SegmentInfo can have a reference to the writer it originated in? That way we can easily access the right reader pool when we need it? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: CHANGES.txt
On 4/24/2009 at 6:24 PM, Michael McCandless wrote: On Fri, Apr 24, 2009 at 5:44 PM, Steven A Rowe sar...@syr.edu wrote: On 4/24/2009 at 4:45 PM, Michael McCandless wrote: On Fri, Apr 17, 2009 at 1:46 PM, Steven A Rowe sar...@syr.edu wrote: - Five issues (LUCENE-1186, 1452, 1453, 1465 and 1544) are mentioned in both the 2.4.1 section and in the Trunk section. AFAICT, it has not been standard practice to mention bug fixes on a major or minor release (which Trunk will become) if they are mentioned on a prior patch release. Hmm -- I thought it'd be good to be clear on which bugs were fixed, where, even if it causes some redundancy? [...] So the policy you're suggesting is: When backporting bug fixes from trunk to a patch version, make note of the change in both the trunk and patch version sections of CHANGES.txt, right? Makes sense (though as I noted, this policy has never before been used) Hmmm. , but why then did you include only 5 out of the 15 bug fixes listed under 2.4.1 in the Trunk section? Yeah good point... let me better describe what I've been doing, and then we can separately decide if it's good or not! For tiny bug fixes, eg say LUCENE-1429 or LUCENE-1474, I often don't include a CHANGES entry in trunk, because I want to keep the signal to noise ratio higher at that point for eventual users upgrading to the next major release. But then when I backport anything to a point release, I try very hard to include an entry in CHANGES for every little change, on the thinking that people considering a point release upgrade really want to know every single change (to properly assess risk/benefit). When I release a point release, I then carry forward its entry back to the trunk's CHANGES, and so then we see some issues listed only in 2.4.1, which is bad since it could make people think they were in fact not fixed on trunk. So what to do? Maybe even tiny bug fixes should always be called out on trunk's CHANGES. Or, maybe a tiny bug fix that also gets backported to a point release, must then be called out in both places? I think I prefer the 2nd. The difference between these two options is that in the 2nd, tiny bug fixes are mentioned in trunk's CHANGES only if they are backported to a point release, right? For the record, the previous policy (the zeroth option :) appears to be that backported bug fixes, regardless of size, are mentioned only once, in the CHANGES for the (chronologically) first release in which they appeared. You appear to oppose this policy, because (paraphrasing): people would wonder whether point release fixes were also fixed on following major/minor releases. IMNSHO, however, people (sometimes erroneously) view product releases as genetically linear: naming a release A.(B)[.x] implies inclusion of all changes to any release A.B[.y]. I.e., my sense is quite the opposite of yours: I would be *shocked* if bug fixes included in version 2.4.1 were not included (or explicitly called out as not included) in version 2.9.0. If more than one point release branch is active at any one time, then things get more complicated (genetic linearity can no longer be assumed), and your new policy seems like a reasonable attempt at managing the complexity. But will Lucene ever have more than one active bugfix branch? It never has before. But maybe I'm not understanding your intent: are you distinguishing between released CHANGES and unreleased CHANGES? That is, do you intend to apply this new policy only to the unreleased trunk CHANGES, but then remove the redundant bugfix notices once a release is performed? Steve - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Another possible optimization - now in DocIdSetIterator
On Fri, Apr 24, 2009 at 6:20 PM, eks dev eks...@yahoo.co.uk wrote: just do not forget to use -1 doc instead of -1 != doc Perhaps doc =0 instead of doc != -1? The crux of it is that status flags (result positive, negative, or zero) are set by many operations - hence a compare/test operation can often be eliminated. For this same reason, counting down to zero in a loop instead of counting up to a limit can be slightly faster. It's a single cycle though normally not much to worry about :-) Of course, now we have processors like the i7 with macro-ops fusion that can take a TEST/CMP followed by a branch and fuse it into a single operation may level the field again. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API
expose lastDocId in the posting from the TermEnum API - Key: LUCENE-1612 URL: https://issues.apache.org/jira/browse/LUCENE-1612 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4 Reporter: John Wang We currently have on the TermEnum api: docFreq() which gives the number docs in the posting. It would be good to also have the max docid in the posting. That information is useful when construction a custom DocIdSet, .e.g determine sparseness of the doc list to decide whether or not to use a BitSet. I have written a patch to do this, the problem with it is the TermInfosWriter encodes values in VInt/VLong, there is very little flexibility to add in lastDocId while making the index backward compatible. (If simple int is used for say, docFreq, a bit can be used to flag reading of a new piece of information) output.writeVInt(ti.docFreq); // write doc freq output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers output.writeVLong(ti.proxPointer - lastTi.proxPointer); Anyway, patch is attached with:TestSegmentTermEnum modified to test this. TestBackwardsCompatibility fails due to reasons described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1612) expose lastDocId in the posting from the TermEnum API
[ https://issues.apache.org/jira/browse/LUCENE-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Wang updated LUCENE-1612: -- Attachment: lucene-1612-patch.txt Patch attach with test. Index is not backwards compatible. expose lastDocId in the posting from the TermEnum API - Key: LUCENE-1612 URL: https://issues.apache.org/jira/browse/LUCENE-1612 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4 Reporter: John Wang Attachments: lucene-1612-patch.txt We currently have on the TermEnum api: docFreq() which gives the number docs in the posting. It would be good to also have the max docid in the posting. That information is useful when construction a custom DocIdSet, .e.g determine sparseness of the doc list to decide whether or not to use a BitSet. I have written a patch to do this, the problem with it is the TermInfosWriter encodes values in VInt/VLong, there is very little flexibility to add in lastDocId while making the index backward compatible. (If simple int is used for say, docFreq, a bit can be used to flag reading of a new piece of information) output.writeVInt(ti.docFreq); // write doc freq output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers output.writeVLong(ti.proxPointer - lastTi.proxPointer); Anyway, patch is attached with:TestSegmentTermEnum modified to test this. TestBackwardsCompatibility fails due to reasons described above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes
TermEnum.docFreq() is not updated with there are deletes Key: LUCENE-1613 URL: https://issues.apache.org/jira/browse/LUCENE-1613 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: John Wang TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated. Attached is a test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes
[ https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Wang updated LUCENE-1613: -- Attachment: TestDeleteAndDocFreq.java Test showing docFreq not updated when there are deletes. TermEnum.docFreq() is not updated with there are deletes Key: LUCENE-1613 URL: https://issues.apache.org/jira/browse/LUCENE-1613 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: John Wang Attachments: TestDeleteAndDocFreq.java TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated. Attached is a test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes
[ https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702649#action_12702649 ] John Wang commented on LUCENE-1613: --- I understand this is a rather difficult problem to fix. I thought keeping a jira ticket would still be good for tracking purposes. Will let the committers decide on the urgency on this issue. TermEnum.docFreq() is not updated with there are deletes Key: LUCENE-1613 URL: https://issues.apache.org/jira/browse/LUCENE-1613 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: John Wang Attachments: TestDeleteAndDocFreq.java TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated. Attached is a test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
perf enhancement and lucene-1345
Hi Guys: A while ago I posted some enhancements to disjunction and conjunction docIdSetIterators that showed performance improvements to Lucene-1345. I think it got mixed up with another discussion on that issue. Was wondering what happened with it and what are the plans. Thanks -John
[jira] Created: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702651#action_12702651 ] Shai Erera commented on LUCENE-1593: bq. Same for BS2.score() and score(HC) - initCountingSumScorer? I proposed this change, but it is problematic. initCountingSumScorer declares throwing IOE, but neither any of the ctors or add() declares it. So I cannot add a call to initCountingSumScorer without breaking back-compat, unless: * I wrap IOE with RE, but I don't like it very much. * We close our eyes saying it's not very likely that BS2 is used on its own, outside BooleanQuery (or BooleanWeight for that matters). I did a short test and added IOE, nothing breaks (at least on the 'java' side, haven't checked contrib), and I'm pretty confident other code will not break as well, since the rest of the methods do throw IOE (like score(), next(), skipTo()), and why would you want to initialize BS2 if not to call these? What do you think about the 2nd option? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org