Re: Lucene's default settings & back compatibility
> > Your example confused me. You're right. I Wrote it with one eye closed already. I meant to say that if I'm a 2.4 user and something gets deprecated in trunk (afterwards), it is carried through 2.4.X and 2.5 and then removed in 2.6. So only 1 full minor release. It's somewhat crazy, but what if we deprecate stuff and rename it? > I absolutely love that idea ! But it means that: 1) We cannot support jar drop-in ability in those cases (which I'm fine with because people can upgrade to 2.4.X to get bug fixes) not just because the API does something different, but because it may not compile. For example, the changes I'm doing in 1614 would have changed next() and skipTo() signature, and so someone who wrote a DISI which has a next() that returns boolean will fail to compile. 2) We give the deprecated API the mediocre names. (A funny thought: we can give those methods/classes really stupid/nasty names, to emphasize the beauty of the existing API, to encourage people to stick with the better API :) ). 3) We document clearly what needs to be done in order to use the deprecated API. One thing we didn't address here fully are methods added to interfaces/abstract classes. When we add a method to an abstract class with a default impl, that's ok. But what if we need to make it abstract (like we had to do in 1575 for the Collector versions)? I guess for interfaces we should first move all of them to abstract classes. I like interfaces. but abstract classes give us slightly more freedom when we face back-compat issues. Maybe to support Earwin's idea, we use the name for a new abstract class, and give the interface a different name? That way to upgrade people just need to change implements to extends (I hope that won't cause any problems if their classes already extend something else). But if we apply this policy to interfaces, I think more users will need to touch their code when upgrading even minor releases. So Mike, about actsAsVersion ... I think I'm starting to get used to it. I do relate to what Marvin writes though, about two different apps running in the same JVM with different settings. We have such a case - two teams develop two search solutions (for two back-ends). They live in the same JVM but have different development plans/schedules. So it's not just a hypothetical problem to me. If we could have the app saying something Version.getInstance(appId).actAsVersion(2.4) that would solve it because each will have its own Id, and the Version class would maintain a map between the Id and an instance. But I've still yet to resolve (in my mind) how the Lucene code will use it, since the same code runs in two apps with different IDs, and so won't know which appId to pass. Oh well .. we're going to change the way those two teams work anyway, so for me at least, this problem will be gone soon :) I also agree that actsAsVersion breaks the localilty principle, in which when you see a bug you should check in the surroundings where the bug happened, and not realize the bug stems from files away. But I don't like passing version information in the constructors also ... What if we continue to process Marvin's proposal on saving that information in the index. I think, Mike, that I asked you a similar question a while ago, about whether Lucene has the ability to store index versions. Index versions are important and can save some of the problems here - not just with storing stopwords list, but also code that manipulates the index, or makes decisions about scoring etc. For the two apps in same JVM it should solve the problem since I think we can safely assume each operates on its own index. Arggh .. but again we face the same problem - how do we pass that information to the different classes? How is a TokenStream expected to read that info? I think we may have to settle on the static Version class, even if it will read the information from the index (by doing some Version.init(File indexDir)). Shai On Fri, May 22, 2009 at 1:53 AM, Marvin Humphrey wrote: > On Thu, May 21, 2009 at 05:19:43PM -0400, Michael McCandless wrote: > > > Marvin, which solution would you prefer? > > Between the two, I'd prefer settings constructor arguments, though I would > be > inclined to have settings classes that are specific to individual classes > rather than Lucene-wide. > > At least that scheme gets locality right. The global actsAsVersion > variable > violates that principle and has the potential to saddle a small number of > users who have done absolutely nothing wrong with bugs that are very, very > hard to hunt down. That's unfair. > > As far as analyzers and token streams, the theoretical answer is making > indexes self-describing via serializable schemas, as discussed on the Lucy > dev > list, and as implemented in KinoSearch svn trunk. With versioning metadata > attached to the index, there is no longer any worry about upgrading > analysis > modules provided that those modules handle their own versioning correctly.
[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711933#action_12711933 ] Wouter Heijke commented on LUCENE-1636: --- I'm on holiday now, but as far as I recollect (as I was not the author of the code) it was done on some filters that would be used in another situation (similar to a filter) to use the filter's functionality. Also it was used with filters that could not be extended, so a new filter was created, also here the orignal filter's public methods would be called. In a way it doesn't matter, it could be done with the api without any problems with the latest few releases that i know of. A more elegant way if one would like to introduce this new behaviour is to at least log some kind of error message in the 2.9 release so users would be alarmed that they use the Lucene api in a way that is not supported anymore. > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711920#action_12711920 ] Uwe Schindler edited comment on LUCENE-1636 at 5/21/09 6:53 PM: Mike: Would this affect backwards compatibility? If we make it final and nobody changes the stream, everything is ok. Is this also the case, when using lucene.jar as a dropin-replacement without recompilation? Will changing a final variable from code, compiled before finalization, be detected by the JVM? Is the compiled code with final still binary compatible to code compiled againt non-final members? I think, we should try this out before committing! was (Author: thetaphi): Mike: Would this affect backwards compatibility? If we make it final and nobody changes the stream, everything is ok. Is this also the case, when using plugin-replacement without recompilation? Will changing a final variable from code, compiled before finalization, be detected by the JVM? Is the compiled code with final still binary compatible to code compiled againt non-final members? I think, we should try this out before committing! > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711920#action_12711920 ] Uwe Schindler commented on LUCENE-1636: --- Mike: Would this affect backwards compatibility? If we make it final and nobody changes the stream, everything is ok. Is this also the case, when using plugin-replacement without recompilation? Will changing a final variable from code, compiled before finalization, be detected by the JVM? Is the compiled code with final still binary compatible to code compiled againt non-final members? I think, we should try this out before committing! > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711919#action_12711919 ] Uwe Schindler commented on LUCENE-1636: --- Hi Wouter, I still want to find out, what you are trying to do with a TokenFilter without a delegate! Can you explain, why you want to initialize with super(null)? If it is because you want to change it later to something non-null, it will not work anymore (this is why I want to make the delgate stream final). So please explain! > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711915#action_12711915 ] Wouter Heijke commented on LUCENE-1636: --- I only hope users will understand this and they realize that 2.9 is not backwards compatible to previous versions! This code in our codebase was added by knowledgeable Lucene developers! > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used
[ https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711868#action_12711868 ] Erik van Zijst edited comment on LUCENE-1474 at 5/21/09 4:47 PM: - I have attached the output of CheckIndex on all our index directories, which seems to report quite a few errors: erik:cache ervzijst$ grep "CorruptIndexException\|AssertionError" CheckIndex.txt java.lang.AssertionError: delete count mismatch: info=1263 vs BitVector=1262 java.lang.AssertionError: delete count mismatch: info=496 vs BitVector=493 java.lang.AssertionError: delete count mismatch: info=101 vs BitVector=100 java.lang.AssertionError: delete count mismatch: info=300 vs BitVector=298 java.lang.AssertionError: delete count mismatch: info=109 vs BitVector=108 java.lang.AssertionError: delete count mismatch: info=140 vs BitVector=139 java.lang.AssertionError: delete count mismatch: info=122 vs BitVector=121 java.lang.AssertionError: delete count mismatch: info=91 vs BitVector=89 java.lang.AssertionError: delete count mismatch: info=1411 vs BitVector=1409 java.lang.AssertionError: delete count mismatch: info=801 vs BitVector=800 java.lang.AssertionError: delete count mismatch: info=630 vs BitVector=629 java.lang.AssertionError: delete count mismatch: info=510 vs BitVector=508 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _0: fieldsReader shows 12365 but segmentInfo shows 12232 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _1: fieldsReader shows 10144 but segmentInfo shows 8766 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _2: fieldsReader shows 4616 but segmentInfo shows 7006 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _3: fieldsReader shows 6681 but segmentInfo shows 4854 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _4: fieldsReader shows 2652 but segmentInfo shows 8808 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _5: fieldsReader shows 11500 but segmentInfo shows 14551 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _6: fieldsReader shows 16225 but segmentInfo shows 4375 erik:cache ervzijst$ P.S. The full log is attached as CheckIndex.txt was (Author: erik.van.zijst): I have attached the output of CheckIndex on all our index directories, which seems to report quite a few errors: erik:cache ervzijst$ grep "CorruptIndexException\|AssertionError" CheckIndex.txt java.lang.AssertionError: delete count mismatch: info=1263 vs BitVector=1262 java.lang.AssertionError: delete count mismatch: info=496 vs BitVector=493 java.lang.AssertionError: delete count mismatch: info=101 vs BitVector=100 java.lang.AssertionError: delete count mismatch: info=300 vs BitVector=298 java.lang.AssertionError: delete count mismatch: info=109 vs BitVector=108 java.lang.AssertionError: delete count mismatch: info=140 vs BitVector=139 java.lang.AssertionError: delete count mismatch: info=122 vs BitVector=121 java.lang.AssertionError: delete count mismatch: info=91 vs BitVector=89 java.lang.AssertionError: delete count mismatch: info=1411 vs BitVector=1409 java.lang.AssertionError: delete count mismatch: info=801 vs BitVector=800 java.lang.AssertionError: delete count mismatch: info=630 vs BitVector=629 java.lang.AssertionError: delete count mismatch: info=510 vs BitVector=508 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _0: fieldsReader shows 12365 but segmentInfo shows 12232 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _1: fieldsReader shows 10144 but segmentInfo shows 8766 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _2: fieldsReader shows 4616 but segmentInfo shows 7006 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _3: fieldsReader shows 6681 but segmentInfo shows 4854 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _4: fieldsReader shows 2652 but segmentInfo shows 8808 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _5: fieldsReader shows 11500 but segmentInfo shows 14551 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _6: fieldsReader shows 16225 but segmentInfo shows 4375 erik:cache ervzijst$ > Incorrect SegmentInfo.delCount when IndexReader.flush() is used > --- > > Key: LUCENE-1474 > URL: https://issues.apache.org/jira/browse/LUCENE-1474 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Marcel Reutegger >Assignee: Michael McCandless > Fix For: 2.4.1, 2.9 > >
[jira] Commented: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used
[ https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711868#action_12711868 ] Erik van Zijst commented on LUCENE-1474: I have attached the output of CheckIndex on all our index directories, which seems to report quite a few errors: erik:cache ervzijst$ grep "CorruptIndexException\|AssertionError" CheckIndex.txt java.lang.AssertionError: delete count mismatch: info=1263 vs BitVector=1262 java.lang.AssertionError: delete count mismatch: info=496 vs BitVector=493 java.lang.AssertionError: delete count mismatch: info=101 vs BitVector=100 java.lang.AssertionError: delete count mismatch: info=300 vs BitVector=298 java.lang.AssertionError: delete count mismatch: info=109 vs BitVector=108 java.lang.AssertionError: delete count mismatch: info=140 vs BitVector=139 java.lang.AssertionError: delete count mismatch: info=122 vs BitVector=121 java.lang.AssertionError: delete count mismatch: info=91 vs BitVector=89 java.lang.AssertionError: delete count mismatch: info=1411 vs BitVector=1409 java.lang.AssertionError: delete count mismatch: info=801 vs BitVector=800 java.lang.AssertionError: delete count mismatch: info=630 vs BitVector=629 java.lang.AssertionError: delete count mismatch: info=510 vs BitVector=508 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _0: fieldsReader shows 12365 but segmentInfo shows 12232 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _1: fieldsReader shows 10144 but segmentInfo shows 8766 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _2: fieldsReader shows 4616 but segmentInfo shows 7006 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _3: fieldsReader shows 6681 but segmentInfo shows 4854 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _4: fieldsReader shows 2652 but segmentInfo shows 8808 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _5: fieldsReader shows 11500 but segmentInfo shows 14551 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _6: fieldsReader shows 16225 but segmentInfo shows 4375 erik:cache ervzijst$ > Incorrect SegmentInfo.delCount when IndexReader.flush() is used > --- > > Key: LUCENE-1474 > URL: https://issues.apache.org/jira/browse/LUCENE-1474 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Marcel Reutegger >Assignee: Michael McCandless > Fix For: 2.4.1, 2.9 > > Attachments: CheckIndex.txt, IndexReaderTest.java > > > When deleted documents are flushed using IndexReader.flush() the delCount in > SegmentInfo is updated based on the current value and > SegmentReader.pendingDeleteCount (introduced by LUCENE-1267). It seems that > pendingDeleteCount is not reset after the commit, which means after a second > flush() or close() of an index reader the delCount in SegmentInfo is > incorrect. A subsequent IndexReader.open() call will fail with an error when > assertions are enabled. E.g.: > java.lang.AssertionError: delete count mismatch: info=3 vs BitVector=2 > at > org.apache.lucene.index.SegmentReader.loadDeletedDocs(SegmentReader.java:405) > [...] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used
[ https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik van Zijst updated LUCENE-1474: --- Attachment: CheckIndex.txt > Incorrect SegmentInfo.delCount when IndexReader.flush() is used > --- > > Key: LUCENE-1474 > URL: https://issues.apache.org/jira/browse/LUCENE-1474 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.4 >Reporter: Marcel Reutegger >Assignee: Michael McCandless > Fix For: 2.4.1, 2.9 > > Attachments: CheckIndex.txt, IndexReaderTest.java > > > When deleted documents are flushed using IndexReader.flush() the delCount in > SegmentInfo is updated based on the current value and > SegmentReader.pendingDeleteCount (introduced by LUCENE-1267). It seems that > pendingDeleteCount is not reset after the commit, which means after a second > flush() or close() of an index reader the delCount in SegmentInfo is > incorrect. A subsequent IndexReader.open() call will fail with an error when > assertions are enabled. E.g.: > java.lang.AssertionError: delete count mismatch: info=3 vs BitVector=2 > at > org.apache.lucene.index.SegmentReader.loadDeletedDocs(SegmentReader.java:405) > [...] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 05:19:43PM -0400, Michael McCandless wrote: > Marvin, which solution would you prefer? Between the two, I'd prefer settings constructor arguments, though I would be inclined to have settings classes that are specific to individual classes rather than Lucene-wide. At least that scheme gets locality right. The global actsAsVersion variable violates that principle and has the potential to saddle a small number of users who have done absolutely nothing wrong with bugs that are very, very hard to hunt down. That's unfair. As far as analyzers and token streams, the theoretical answer is making indexes self-describing via serializable schemas, as discussed on the Lucy dev list, and as implemented in KinoSearch svn trunk. With versioning metadata attached to the index, there is no longer any worry about upgrading analysis modules provided that those modules handle their own versioning correctly. For instance, in KS the Stopalizer always embeds the complete stoplist in the schema file, so even if we update the "English" stoplist, we don't get invalid search results for indexes which were created with the old stoplist. Similarly, it may not be possible to keep around multiple variants of Snowball, but at least we can fail catastrophically instead of subtly if we detect that the Snowball version has changed. Full-on schema serialization isn't feasible for Lucene, but attaching an actsAsVersion variable to an index and feeding that to your analyzers would be a decent start. Lastly, I think a major java Lucene release is justified already. Won't this discussion die down somewhat if you can get 3.0 out? If there are issues that are half done, how about rolling back whatever's in the way? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1646) QueryParser throws new exceptions even if custom parsing logic threw a better one
[ https://issues.apache.org/jira/browse/LUCENE-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711850#action_12711850 ] Trejkaz commented on LUCENE-1646: - Our improvements are (so far) specific to our subclass of QueryParser, in that we use it when getFieldQuery() gets a value which doesn't make sense for the given field. So in a sense, in our case the query was parsed successfully by the parser, but the input was invalid within one of the fields. As such our custom ParseException subclass has the field name and field value, but it isn't useful to the Lucene project as-is, as the only things throwing it are called from our subclass. :-( > QueryParser throws new exceptions even if custom parsing logic threw a better > one > - > > Key: LUCENE-1646 > URL: https://issues.apache.org/jira/browse/LUCENE-1646 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Trejkaz > > We have subclassed QueryParser and have various custom fields. When these > fields contain invalid values, we throw a subclass of ParseException which > has a more useful message (and also a localised message.) > Problem is, Lucene's QueryParser is doing this: > {code} > catch (ParseException tme) { > // rethrow to include the original query: > throw new ParseException("Cannot parse '" +query+ "': " + > tme.getMessage()); > } > {code} > Thus, our nice and useful ParseException is thrown away, replaced by one with > no information about what's actually wrong with the query (it does append > getMessage() but that isn't localised. And it also throws away the > underlying cause for the exception.) > I am about to patch our copy to simply remove these four lines; the caller > knows what the query string was (they have to have a copy of it because they > are passing it in!) so having it in the error message itself is not useful. > Furthermore, when the query string is very big, what the user wants to know > is not that the whole query was bad, but which part of it was bad. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene's default settings & back compatibility
On 5/21/2009 at 7:17 AM, Michael McCandless wrote: > OK so it sounds like we've boiled the proposal down to two concrete > changes to the back-compat policy: > > 1) Default settings can change; we will always choose defaults > based on "latest & greatest for new users". This only > affects "runtime behavior". EG in 2.9, when sorting by > field you won't get scores by default. When we do this we > should clearly document the change, and what settings one > could use to get back to the old behavior, in CHANGES.txt. > > 2) An API, once released as deprecated, is fair game to be > removed in the next minor release. > > We still only make bug fixes on point releases, support the index > file format until the next major release -- those don't change. Contrasting the upgrade experience of existing "maintenance" users (i.e., users not using new Lucene features) under current policy with their experience under the above proposals: Currently there are two upgrade experiences for these users: a) upgrading within the same major release; and b) major release upgrades. For a), the user reads CHANGES for back-compat exceptions, but otherwise has drop-in compatibility. For b), the user performs two upgrades: first, just like in a), to the last minor release in the same major release, addressing all deprecation warnings; and second, to the major release, with drop-in compatibility, modulo CHANGES. Here's the upgrade procedure under the above proposals, from version X.Y to X.Z: 1. Address all deprecation warnings against the currently used Lucene version (call it version X.Y[0]). 2. Upgrade to X.(++Y), addressing all deprecation warnings and checking CHANGES for exceptions to the back-compat policy, including mechanisms to maintain X.Y[0] defaults. 3. Iterate #2 until Y==Z. One consequence of these changes is that major version upgrades the same as minor version upgrades, with the exception that index format support (and default settings support?) will potentially require attention. Another consequence is that upgrade effort will no longer be amortizable. Currently, maintenance users can skip minor version upgrades with almost no penalty, and defer the upgrade pain to major release upgrades, since deprecation warnings can be safely ignored. (Not advocating this practice, just noting that it's possible.) Steve - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 5:55 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, May 21, 2009 at 5:44 PM, Robert Muir wrote: > > and what if your analyzer needs a third-party library (or two)? > > In such cases the back-compat of your analyzer is your responsibility, > right? ICUCollationFilter is a simple example. just saying, pretend there was a back-compat issue with that, maybe even specific to a certain Locale. that could get nasty quick -- Robert Muir rcm...@gmail.com
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 5:44 PM, Robert Muir wrote: > and what if your analyzer needs a third-party library (or two)? In such cases the back-compat of your analyzer is your responsibility, right? > i mean this isn't unique to analyzers, if something changes/bug is fixed in > the guts of some query/scorer that affects scoring in the slightest then > thats a potential issue too, right? > > for a big index burying a result deep is effectively the same as the > stopword example... If it's a bug fix, or a change in order-of-operations causing slightly different floating point truncations, we are free to make those fixes (even under the current back-compat policy)? Ie, nothing is changing for those cases. But, say we found some improvement to how Lucene does scoring, and by and large it improves relevance so we want to do it. New users should see this benefit. Back-compat users, I think, should be able to set actsAsVersion to get back to the old scoring model. So yeah I think I agree it's not just changes that affect what gets indexed, but also changes that affect how scores are computed, where we need a way to specify a back-compat version on upgrading. I think we can't get away with only policy changes here... I think we need actsAsVersion to preserve back-compat. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
and what if your analyzer needs a third-party library (or two)? i mean this isn't unique to analyzers, if something changes/bug is fixed in the guts of some query/scorer that affects scoring in the slightest then thats a potential issue too, right? for a big index burying a result deep is effectively the same as the stopword example... On Thu, May 21, 2009 at 5:27 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, May 21, 2009 at 5:19 PM, Earwin Burrfoot wrote: > >> Why not store an "actsAs" in the index, just for the changes that > >> affect what's in the index? Ie the index records the > >> version that created it, and by default TokenStreams emulate their > >> behavior as of that version? > > > > Because you don't always have access to index at the time you create > > your TokenStreams? > > Such places would have to pass in their own actsAs when they ask the > Analyzer for the tokenStream? > > Ie, the benefit of this approach vs the single global default is it'd > be per-instance. > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 5:19 PM, Earwin Burrfoot wrote: >> Why not store an "actsAs" in the index, just for the changes that >> affect what's in the index? Ie the index records the >> version that created it, and by default TokenStreams emulate their >> behavior as of that version? > > Because you don't always have access to index at the time you create > your TokenStreams? Such places would have to pass in their own actsAs when they ask the Analyzer for the tokenStream? Ie, the benefit of this approach vs the single global default is it'd be per-instance. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 1:59 PM, Marvin Humphrey wrote: > That bug has led to 'base' having a compromised reputation among elite users > because of intermittent, inexplicable flakiness. Is that what you want for > Lucene? While I agree a single global default is not great, I do think it's the lesser of all evils here. It really bothers me that our new users must wait so long (years) to see improvements to our default settings, because we are so careful about back-compat. Marvin, which solution would you prefer? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
> Why not store an "actsAs" in the index, just for the changes that > affect what's in the index? Ie the index records the > version that created it, and by default TokenStreams emulate their > behavior as of that version? Because you don't always have access to index at the time you create your TokenStreams? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.
[ https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711810#action_12711810 ] Michael McCandless commented on LUCENE-1436: OK why don't we make both package private? Are there other classes BooleanScorer/2 are using that should be made package private? > Make ReqExclScorer package private, and use DocIdSetIterator for excluded > part. > --- > > Key: LUCENE-1436 > URL: https://issues.apache.org/jira/browse/LUCENE-1436 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Trivial > Attachments: LUCENE-1436.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 4:34 PM, Shai Erera wrote: > Changes to the index file formats need to be supported for 2 major releases. > I.e. 2.X indexes need to be read by 3.Y code, but not by 4.0. Agreed. > Method deprecations last for one full minor release. Your example confused me. I think if in 2.1 we deprecate a method, then in 2.2 we can remove it? Or are you saying it's not until 2.3 that we can remove it (2 full minor releases)? > What I don't like about actsAs, and perhaps I just don't understand the > proposal well, is that I'm not sure where it's added. Will it be added to > IndexWriter, which will pass it on to all the classes it will meet/use? We would add say oal.Versions class, that has a static actsAs method and static constants. If back-compat is vital to your app you'd do: Versions.actsAs(Version.LUCENE_24) on upgrading to 2.9. Else, you do nothing to get "latest & greatest". You call this once in your app up front, and then use Lucene normally. Then, when IndexSearcher is asked to do field sorting, it consults actsAs to decide whether it should do scoring or not. > Changes to default behaviors, bug fixes or improvements, that may compromise > the index structure or indexed data (such as InvalidAcronym) will last for at > least one major release, if not 2 I think this harms new users unnecessarily (ie, I'd rather do actsAs than this). I'd like StopFilter to not discard positional information, fixes for bugs in StandardAnalyzer, and a correction to the default stopwords list, to be immediately available for new users on the next release. Why not store an "actsAs" in the index, just for the changes that affect what's in the index? Ie the index records the version that created it, and by default TokenStreams emulate their behavior as of that version? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Sounds like a good proposition. There's one problem I'd like to address. Good names for classes/members matter, and matter much. They directly affect how fast a newcomer is able to understand that particular API, it also affects how comfortable you work with it once you did understand. When we're deprecating existing methods and adding new, 'better' ones, bad or mediocre names replace good names in parts of the code that are most often used. And there's no way around it. It's somewhat crazy, but what if we deprecate stuff and rename it? New stuff gets best names, old stuff is still accessible and with a "Migration Guide" it's easy to patch client code. On Fri, May 22, 2009 at 00:34, Shai Erera wrote: > I thought we were actually on the track towards not introducing any Settings > and/or actAs, but instead just change the policy? > > Can we agree on the following: > > * Changes to the index file formats need to be supported for 2 major > releases. I.e. 2.X indexes need to be read by 3.Y code, but not by 4.0. > > * Method deprecations last for one full minor release. That is a deprecation > in 2.X lasts through 2.X.1, 2.X+1 but is removed in 2.X+2. If all those X's > are confusing --> a deprecation in 2.4 is kept in 2.4.X and 2.5, but we're > free to remove it in 2.6. > > * Changes to default behaviors (whether they are bug fixes or improvements), > where it only affects runtime code, and not the index structure or indexed > data (such as the InvalidAcronym bug fix) are ok to go into any minor > release, w/o deprecation - so long we're documenting the change in CHANGES > along with some sample code on how to migrate easily. > > * Changes to default behaviors, bug fixes or improvements, that may > compromise the index structure or indexed data (such as InvalidAcronym) will > last for at least one major release, if not 2 (just like supporting file > formats). The reason is - rebuilding indexes, besides that it might be a > heavy process, is not often acceptable by the customers of those who develop > search solutions. Therefore it may be out of our hands. Personally, I don't > think those will happen a lot, but when they will we can choose between: > (1) Deprecating a class entirely in favor of a new one, such that anyone who > upgrades can still use that class > (2) Introduce a static setter for that behavior, like for InvalidAcronym > (3) Add a actsAs to that class only. > > Am I missing a back-compat issue? > > What I don't like about actsAs, and perhaps I just don't understand the > proposal well, is that I'm not sure where it's added. Will it be added to > IndexWriter, which will pass it on to all the classes it will meet/use? > > If I covered all the back-compat issues above, and we agree on them, then > for the first 3 we just need to document them on the back-compat page, no > code to develop. > > For the last one, if we choose to adopt (1) or (2), then we don't need to > develop any mehcanism up-front, but decide on a per-case basis what's the > best alternative. For example, for the InvalidAcronym we could have > deprecated that particular TokenFilter in favor of a new one and give a code > example on how to create a TokenStream with that deprecated TokenFilter. > > Shai > > On Thu, May 21, 2009 at 10:55 PM, Jason Rutherglen > wrote: >> >> I'm having trouble visualizing the various methods people are talking >> about. It seems like we could open an issue and post patches with code >> illustrating what each person is talking about? >> >> On Thu, May 21, 2009 at 10:02 AM, Michael McCandless >> wrote: >>> >>> Actually, we started with the *Settings classes (to hold defaults), >>> but then realized a simple actsAsVersion (single static method) would >>> suffice for just the back-compat settings and then pushed further and >>> thought perhaps we should relax our back-compat policy entirely so >>> emulating older versions is not needed. >>> >>> So we no longer have the "defaults" class (*Settings). We may still >>> do it for the future (for its own benefits), but for just back-compat >>> of default settings, it seems like overkill. >>> >>> But I agree, the index altering cases are spooky. I think this'd make >>> me favor going back to the actsAsVersion option instead of the hard >>> flip on our back compat policy (at least for default settings; for API >>> changes I think 1 whole minor release may be reasonable). >>> >>> Mike >>> >>> On Thu, May 21, 2009 at 12:54 PM, Matthew Hall >>> wrote: >>> > Sorry, I wasn't quite sure what to call this new class you guys have >>> > been >>> > talking about. >>> > >>> > I was referring to the class that's being discussed to encapsulate all >>> > of >>> > the defaults for a given lucene release. (Its caching strategies etc >>> > etc) >>> > >>> > I'm just not certain that something like a static list of words belongs >>> > in a >>> > higher level defaults class like you guys are talking about, especially >>> > considering that anyone using a stop enabled analy
[jira] Created: (LUCENE-1653) Change DateTools to not create a Calendar in every call to dateToString or timeToString
Change DateTools to not create a Calendar in every call to dateToString or timeToString --- Key: LUCENE-1653 URL: https://issues.apache.org/jira/browse/LUCENE-1653 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Shai Erera Priority: Minor Fix For: 2.9 DateTools creates a Calendar instance on every call to dateToString and timeToString. Specifically: # timeToString calls Calendar.getInstance on every call. # dateToString calls timeToString(date.getTime()), which then instantiates a new Date(). I think we should change the order of the calls, or not have each call the other. # round(), which is called from timeToString (after creating a Calendar instance) creates another (!) Calendar instance ... Seems that if we synchronize the methods and create the Calendar instance once (static), it should solve it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711791#action_12711791 ] Shai Erera commented on LUCENE-1614: {quote} Are both new in 2.9? Yes. {quote} Oh that's great - and here I was deprecating next(int) in favor of the new advance. I'll just delete it then. bq. I'm going to open the follow-up issue to that Opened LUCENE-1652 and copied what's relevant from this issue to there. If I missed something, please add it. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711792#action_12711792 ] Shai Erera commented on LUCENE-1595: Ok I'll make sure it's 1.4 compatible then. > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.
[ https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711793#action_12711793 ] Paul Elschot commented on LUCENE-1436: -- The reason to make things package private is to have more freedom to change things lateron that are not really part of the public api now. > Make ReqExclScorer package private, and use DocIdSetIterator for excluded > part. > --- > > Key: LUCENE-1436 > URL: https://issues.apache.org/jira/browse/LUCENE-1436 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Trivial > Attachments: LUCENE-1436.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1652) Enhancements to Scorers following the changes to DocIdSetIterator
Enhancements to Scorers following the changes to DocIdSetIterator - Key: LUCENE-1652 URL: https://issues.apache.org/jira/browse/LUCENE-1652 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 3.0 In LUCENE-1614, we changed the semantics of DocIdSetIterator's methods to return a sentinel NO_MORE_DOCS (= Integer.MAX_VALUE) when the iterator has exhausted. Due to backward compatibility issues, we couldn't implement that semantics in doc(). Therefore this issue, which can be introduced in 3.0 only will: # Implement the new semantics in all extending classes, such that doc() will return NO_MORE_DOCS when the iterator has exhausted. # Change BooleanScorer to take advantage of that by removing sub.done from SubScorer and operate under the assumption that NO_MORE_DOCS is larger than any doc ID (Integer.MAX_VALUE). # Change ConjunctionScorer to operate under the same assumptions and remove 'more'. # Change ReqExclScorer to not rely on reqScorer in doc(), since the latter may be null. # Make more changes to ConjunctionScorer's init() and remove 'firstTime' to improve the performance of nextDoc(), score(), advance(). # Add start()/finish() to DISI? A snippet from LUCENE-1614 regarding the change in BooleanScorer {code} int doc = sub.done ? -1 : scorer.doc(); while (!sub.done && doc < end) { sub.collector.collect(doc); doc = scorer.nextDoc(); sub.done = doc < 0; } {code} To this: {code} int doc = scorer.doc(); while (doc < end) { sub.collector.collect(doc); doc = scorer.nextDoc(); } {code} And in ConjunctionScorer, change this: {code} while (more && (firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) { more = firstScorer.advance(lastDoc) >= 0; lastScorer = firstScorer; first = (first == (scorers.length-1)) ? 0 : first+1; } return more; {code} To this: {code} while ((firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) { firstScorer.advance(lastDoc); lastScorer = firstScorer; first = (first == (scorers.length-1)) ? 0 : first+1; } return lastDoc != DOC_SENTINEL; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.
[ https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711790#action_12711790 ] Paul Elschot commented on LUCENE-1436: -- This should only affect external code that uses the ReqExclScorer class. >From the amount of reactions on that so far I think such code probably does >not exist. The move to DocIdSetIterator (superclass of Scorer) can be done because no score values are used on the excluded part. This is only a consequence of the extraction of the DocIdSetIterator superclass from Scorer. So I don't expect LUCENE-1614 or new semantics of DISI to collide with this. Also, I don't expect LUCENE-1614 to collide with LUCENE-1345 that allows filter as clause to boolean query. Some time ago DisjunctionSumScorer was made private in a similar way. ReqOptSumScorer could also be made package private, perhaps even at the same time as ReqExclScorer. > Make ReqExclScorer package private, and use DocIdSetIterator for excluded > part. > --- > > Key: LUCENE-1436 > URL: https://issues.apache.org/jira/browse/LUCENE-1436 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Trivial > Attachments: LUCENE-1436.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
I thought we were actually on the track towards not introducing any Settings and/or actAs, but instead just change the policy? Can we agree on the following: * Changes to the index file formats need to be supported for 2 major releases. I.e. 2.X indexes need to be read by 3.Y code, but not by 4.0. * Method deprecations last for one full minor release. That is a deprecation in 2.X lasts through 2.X.1, 2.X+1 but is removed in 2.X+2. If all those X's are confusing --> a deprecation in 2.4 is kept in 2.4.X and 2.5, but we're free to remove it in 2.6. * Changes to default behaviors (whether they are bug fixes or improvements), where it only affects runtime code, and not the index structure or indexed data (such as the InvalidAcronym bug fix) are ok to go into any minor release, w/o deprecation - so long we're documenting the change in CHANGES along with some sample code on how to migrate easily. * Changes to default behaviors, bug fixes or improvements, that may compromise the index structure or indexed data (such as InvalidAcronym) will last for at least one major release, if not 2 (just like supporting file formats). The reason is - rebuilding indexes, besides that it might be a heavy process, is not often acceptable by the customers of those who develop search solutions. Therefore it may be out of our hands. Personally, I don't think those will happen a lot, but when they will we can choose between: (1) Deprecating a class entirely in favor of a new one, such that anyone who upgrades can still use that class (2) Introduce a static setter for that behavior, like for InvalidAcronym (3) Add a actsAs to that class only. Am I missing a back-compat issue? What I don't like about actsAs, and perhaps I just don't understand the proposal well, is that I'm not sure where it's added. Will it be added to IndexWriter, which will pass it on to all the classes it will meet/use? If I covered all the back-compat issues above, and we agree on them, then for the first 3 we just need to document them on the back-compat page, no code to develop. For the last one, if we choose to adopt (1) or (2), then we don't need to develop any mehcanism up-front, but decide on a per-case basis what's the best alternative. For example, for the InvalidAcronym we could have deprecated that particular TokenFilter in favor of a new one and give a code example on how to create a TokenStream with that deprecated TokenFilter. Shai On Thu, May 21, 2009 at 10:55 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > I'm having trouble visualizing the various methods people are talking > about. It seems like we could open an issue and post patches with code > illustrating what each person is talking about? > > On Thu, May 21, 2009 at 10:02 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Actually, we started with the *Settings classes (to hold defaults), >> but then realized a simple actsAsVersion (single static method) would >> suffice for just the back-compat settings and then pushed further and >> thought perhaps we should relax our back-compat policy entirely so >> emulating older versions is not needed. >> >> So we no longer have the "defaults" class (*Settings). We may still >> do it for the future (for its own benefits), but for just back-compat >> of default settings, it seems like overkill. >> >> But I agree, the index altering cases are spooky. I think this'd make >> me favor going back to the actsAsVersion option instead of the hard >> flip on our back compat policy (at least for default settings; for API >> changes I think 1 whole minor release may be reasonable). >> >> Mike >> >> On Thu, May 21, 2009 at 12:54 PM, Matthew Hall >> wrote: >> > Sorry, I wasn't quite sure what to call this new class you guys have >> been >> > talking about. >> > >> > I was referring to the class that's being discussed to encapsulate all >> of >> > the defaults for a given lucene release. (Its caching strategies etc >> etc) >> > >> > I'm just not certain that something like a static list of words belongs >> in a >> > higher level defaults class like you guys are talking about, especially >> > considering that anyone using a stop enabled analyzer really should be >> > familiar with this list, and oftentimes needs to override it. >> > >> > Meh, now that I'm actually typing it out though, perhaps I'm incorrect >> here, >> > assuming this class you guys are describing will be well >> > advertised/documented maybe it will actually make it easier for end >> > developers to twiddle around with this list, or at least certainly make >> them >> > more aware that its even something that they have the ability to >> actually >> > change. >> > >> > Matt >> > >> > Michael McCandless wrote: >> >> >> >> What is the "lucene defaults class"? >> >> >> >> Mike >> >> >> >> On Thu, May 21, 2009 at 12:37 PM, Matthew Hall >> >> wrote: >> >> >> >>> >> >>> For extreme examples like this, couldn't the stopword list be >> >>> encapsulated >> >>> into a
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711785#action_12711785 ] Michael McCandless commented on LUCENE-1614: bq. Are you sure about it? Yes. bq. If so, then why test-tag failed on it? Unfortunately, we came up with the idea of the back-compat branch after 2.4 was released, so we cut the branch at that point (in 2.9), so the back-compat branch does contain tests for early 2.9-only features. bq. Are both new in 2.9? Yes. {quote} BTW, I'm going to open the follow-up issue to that, so we can discuss whatever improvements we want to make to the Scorers following the MAX_VAL sentinel there. Otherwise, they will get lost in this issue, and when we'll handle the follow-up one, we might not remember everything. {quote} Agreed! > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.
[ https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711783#action_12711783 ] Shai Erera commented on LUCENE-1436: I just hope this does not collide with LUCENE-1614. Note that in the follow-up issue to 1614 (the number is yet to be generated) I intend to change the logic in ReqExclScorer so that it conforms to the new semantics of DISI. Why do we need to make this class package-private? Any particular reason besides not needing to worry about changes to back-compat in the future? > Make ReqExclScorer package private, and use DocIdSetIterator for excluded > part. > --- > > Key: LUCENE-1436 > URL: https://issues.apache.org/jira/browse/LUCENE-1436 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Trivial > Attachments: LUCENE-1436.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711782#action_12711782 ] Shai Erera commented on LUCENE-1614: bq. Oh, it turns out OBSI.nextDoc is new in 2.9! Are you sure about it? If so, then why test-tag failed on it? Notice that there are two methods nextDoc() and next(int). Are both new in 2.9? If so, it means somebody added them to the tag, for some reason ... BTW, I'm going to open the follow-up issue to that, so we can discuss whatever improvements we want to make to the Scorers following the MAX_VAL sentinel there. Otherwise, they will get lost in this issue, and when we'll handle the follow-up one, we might not remember everything. bq. It's called - @Since That's absolutely a great idea ! > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
I'm having trouble visualizing the various methods people are talking about. It seems like we could open an issue and post patches with code illustrating what each person is talking about? On Thu, May 21, 2009 at 10:02 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Actually, we started with the *Settings classes (to hold defaults), > but then realized a simple actsAsVersion (single static method) would > suffice for just the back-compat settings and then pushed further and > thought perhaps we should relax our back-compat policy entirely so > emulating older versions is not needed. > > So we no longer have the "defaults" class (*Settings). We may still > do it for the future (for its own benefits), but for just back-compat > of default settings, it seems like overkill. > > But I agree, the index altering cases are spooky. I think this'd make > me favor going back to the actsAsVersion option instead of the hard > flip on our back compat policy (at least for default settings; for API > changes I think 1 whole minor release may be reasonable). > > Mike > > On Thu, May 21, 2009 at 12:54 PM, Matthew Hall > wrote: > > Sorry, I wasn't quite sure what to call this new class you guys have been > > talking about. > > > > I was referring to the class that's being discussed to encapsulate all of > > the defaults for a given lucene release. (Its caching strategies etc > etc) > > > > I'm just not certain that something like a static list of words belongs > in a > > higher level defaults class like you guys are talking about, especially > > considering that anyone using a stop enabled analyzer really should be > > familiar with this list, and oftentimes needs to override it. > > > > Meh, now that I'm actually typing it out though, perhaps I'm incorrect > here, > > assuming this class you guys are describing will be well > > advertised/documented maybe it will actually make it easier for end > > developers to twiddle around with this list, or at least certainly make > them > > more aware that its even something that they have the ability to actually > > change. > > > > Matt > > > > Michael McCandless wrote: > >> > >> What is the "lucene defaults class"? > >> > >> Mike > >> > >> On Thu, May 21, 2009 at 12:37 PM, Matthew Hall > >> wrote: > >> > >>> > >>> For extreme examples like this, couldn't the stopword list be > >>> encapsulated > >>> into a single class that's used by the lucene defaults class. > >>> > >>> That way if you folks released updates to mostly static content like a > >>> stopword list, new or old users could get it easily with a simple drop > in > >>> fix? > >>> > >>> Just my two cents. > >>> > >>> Matt > >>> > >>> Michael McCandless wrote: > >>> > > On Thu, May 21, 2009 at 12:19 PM, Robert Muir > wrote: > > > > > > even as simple as changing default stopword list for some analyzer > > could > > be > > an issue, if the user doesn't re-index in response to that change. > > > > > > OK, right. > > So say we forgot to include "the" in the default English stopwords > list (yes, an extreme example...). > > Under the proposed changes 1 & 2 to back-compat policy, we would add > "the" to the default stopword list, so new users get the fix, but > still keep the the-less list accessible (deprecated). We'd add an > entry in CHANGES.txt saying this happened, and then show code on how > to get back to the the-less stopword list. > > New users using that StopFilter would properly see "the" filtered out. > Users who upgraded would need to fix their code to switch back to the > deprecated the-less list. > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > >>> > >>> -- > >>> Matthew Hall > >>> Software Engineer > >>> Mouse Genome Informatics > >>> mh...@informatics.jax.org > >>> (207) 288-6012 > >>> > >>> > >>> > >>> - > >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >>> > >>> > >>> > >> > >> - > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > >> > > > > > > -- > > Matthew Hall > > Software Engineer > > Mouse Genome Informatics > > mh...@informatics.jax.org > > (207) 288-6012 > > > > > > > > - > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > > > --
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711773#action_12711773 ] Earwin Burrfoot commented on LUCENE-1614: - bq. Oh, it turns out OBSI.nextDoc is new in 2.9! The phrase sounds all too familiar :) There's one absolutely cool javadoc tag, which I suggest we start using for all user-visible classes and their members. It's called - @Since. Suddenly, everything that's not yet released (and that's a big bunch), is clearly marked as free for changes and amendments. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
> That bug has led to 'base' having a compromised reputation among elite users > because of intermittent, inexplicable flakiness. Is that what you want for > Lucene? While I agree with that point, Lucene already has lots and lots of static configuration. Having actsAsVersion won't add any new woes. Well, it won't remove them either. I dislike both single static variable approach and Settings object you have to pass into each end every constructor. If I absolutely must choose and have no right to just throw all these back-compat crutches out of the window, I'd choose a single static variable. On the offnote, it's amusing to see how the discussion is calming down and will seemingly end in minor amendmends to the policy, if any at all :) -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.
[ https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711753#action_12711753 ] Michael McCandless commented on LUCENE-1436: Paul, this is technically a change in back-compat (since it's a public class, that this patch makes package private). Are you thinking it's just very unlikely this is used by anything except BooleanScorer2? We also have at least ReqOptSumScorer (and maybe others?) which is public. > Make ReqExclScorer package private, and use DocIdSetIterator for excluded > part. > --- > > Key: LUCENE-1436 > URL: https://issues.apache.org/jira/browse/LUCENE-1436 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Trivial > Attachments: LUCENE-1436.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Mike McCandless: > Well this is what I love about the actsAsVersion solution. There's no > pain for our back-compat users (besides the one-time effort to set > actsAsVersion), and new users always get the best settings. When some mad-as-hell user complains to this list after spending an inordinate amount of time chasing down an action-at-a-distance bug because of this insidious and irresponsible OO design decision, I intend to follow up their email with an I-told-you-so. There's an action-at-a-distance bug in the Perl core module 'base.pm' that bedeviled people for years before I finally cornered it. Turns out it can't be fixed, but at least now we know what's happening: http://rt.cpan.org/Public/Bug/Display.html?id=28799 While this error does not occur frequently in the wild, when it does, the cost to the user is high because the debug path is obscure. I personally encountered it after failing to wrap a "use_ok" test in a BEGIN block; isolating it took me... longer than I would have liked. ;) That bug has led to 'base' having a compromised reputation among elite users because of intermittent, inexplicable flakiness. Is that what you want for Lucene? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1636. Resolution: Fixed Fix Version/s: 2.9 > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711704#action_12711704 ] Michael McCandless commented on LUCENE-1636: I think we should change this in 2.9, for the reasons Uwe pointed out, to disallow changing the delegate after construction. > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1636) TokenFilters with a null value in the constructor fail
[ https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1636: -- Assignee: Michael McCandless > TokenFilters with a null value in the constructor fail > -- > > Key: LUCENE-1636 > URL: https://issues.apache.org/jira/browse/LUCENE-1636 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.9 >Reporter: Wouter Heijke >Assignee: Michael McCandless > Attachments: LUCENE-1636.patch > > > While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests. > One problem is with TokenFilters that do a super(null) in the constructor. > I fixed it by changing the constructor to super(new EmptyTokenStream()) > This will cause problems and frustration to others while migrating to 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1637) Getting an IndexReader from a committed IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711701#action_12711701 ] Michael McCandless commented on LUCENE-1637: Couldn't you simply call IW.getReader() just after calling IW.commit()? Or just IndexReader.reopen? (The extra work that reopen does vs IndexWriter.getReader is fairly minor -- the former must find & open the segments_N file; after that it's the same). > Getting an IndexReader from a committed IndexWriter > --- > > Key: LUCENE-1637 > URL: https://issues.apache.org/jira/browse/LUCENE-1637 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.9 >Reporter: Shay Banon > > I just had a look at the job done in IndexWriter in order to get an > IndexReader with all the current ongoing changes done using the IndexWriter. > This feature is very useful, and I was wondering if another feature, which (I > think) is simple to implement (compared to the previous one) might make > sense. >Many times, an application opens an IndexWriter, does whatever changes it > does, and then commits the changes. It would be nice to get an IndexReader > (read only one is fine) that corresponds to the committed (or even closed) > IndexWriter. This will allow for a cache of IndexReader that is already used > to be updated with a fresh IndexReader, without the need to reopen one (which > should be slower than opening one based on the IndexWriter information). The > main difference is the fact that the mentioned IndexReader could still be > reopened without the need to throw an AlreadyClosedException. >More information can be found here: > http://www.nabble.com/Getting-an-IndexReader-from-a-committed-IndexWriter-td23551978.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711695#action_12711695 ] Michael McCandless commented on LUCENE-1614: Oh, it turns out OBSI.nextDoc is new in 2.9! So we are free to change it... > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Michael McCandless wrote: On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: even as simple as changing default stopword list for some analyzer could be an issue, if the user doesn't re-index in response to that change. OK, right. So say we forgot to include "the" in the default English stopwords list (yes, an extreme example...). "The" would be a bug fix. I think most users would expect that to be fixed. They might be willing, as I would be, to require all their indexes using that stopword list to be rebuilt. How about a change that would be a bit more controversial, to which some would agree and others would not. I wonder how many people are creating metadata about indexes so that they can track when an index could/should/must be rebuilt? Some kind of "versioned tool chain info" for the index. If analyzers and filters can change output then it needs to be tracked. -- DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Actually, we started with the *Settings classes (to hold defaults), but then realized a simple actsAsVersion (single static method) would suffice for just the back-compat settings and then pushed further and thought perhaps we should relax our back-compat policy entirely so emulating older versions is not needed. So we no longer have the "defaults" class (*Settings). We may still do it for the future (for its own benefits), but for just back-compat of default settings, it seems like overkill. But I agree, the index altering cases are spooky. I think this'd make me favor going back to the actsAsVersion option instead of the hard flip on our back compat policy (at least for default settings; for API changes I think 1 whole minor release may be reasonable). Mike On Thu, May 21, 2009 at 12:54 PM, Matthew Hall wrote: > Sorry, I wasn't quite sure what to call this new class you guys have been > talking about. > > I was referring to the class that's being discussed to encapsulate all of > the defaults for a given lucene release. (Its caching strategies etc etc) > > I'm just not certain that something like a static list of words belongs in a > higher level defaults class like you guys are talking about, especially > considering that anyone using a stop enabled analyzer really should be > familiar with this list, and oftentimes needs to override it. > > Meh, now that I'm actually typing it out though, perhaps I'm incorrect here, > assuming this class you guys are describing will be well > advertised/documented maybe it will actually make it easier for end > developers to twiddle around with this list, or at least certainly make them > more aware that its even something that they have the ability to actually > change. > > Matt > > Michael McCandless wrote: >> >> What is the "lucene defaults class"? >> >> Mike >> >> On Thu, May 21, 2009 at 12:37 PM, Matthew Hall >> wrote: >> >>> >>> For extreme examples like this, couldn't the stopword list be >>> encapsulated >>> into a single class that's used by the lucene defaults class. >>> >>> That way if you folks released updates to mostly static content like a >>> stopword list, new or old users could get it easily with a simple drop in >>> fix? >>> >>> Just my two cents. >>> >>> Matt >>> >>> Michael McCandless wrote: >>> On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: > > even as simple as changing default stopword list for some analyzer > could > be > an issue, if the user doesn't re-index in response to that change. > > OK, right. So say we forgot to include "the" in the default English stopwords list (yes, an extreme example...). Under the proposed changes 1 & 2 to back-compat policy, we would add "the" to the default stopword list, so new users get the fix, but still keep the the-less list accessible (deprecated). We'd add an entry in CHANGES.txt saying this happened, and then show code on how to get back to the the-less stopword list. New users using that StopFilter would properly see "the" filtered out. Users who upgraded would need to fix their code to switch back to the deprecated the-less list. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> -- >>> Matthew Hall >>> Software Engineer >>> Mouse Genome Informatics >>> mh...@informatics.jax.org >>> (207) 288-6012 >>> >>> >>> >>> - >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >>> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > > -- > Matthew Hall > Software Engineer > Mouse Genome Informatics > mh...@informatics.jax.org > (207) 288-6012 > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
yeah, i was thinking the more likely case of where something like "teh" is in the list... On Thu, May 21, 2009 at 12:25 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: > > even as simple as changing default stopword list for some analyzer could > be > > an issue, if the user doesn't re-index in response to that change. > > OK, right. > > So say we forgot to include "the" in the default English stopwords > list (yes, an extreme example...). > > Under the proposed changes 1 & 2 to back-compat policy, we would add > "the" to the default stopword list, so new users get the fix, but > still keep the the-less list accessible (deprecated). We'd add an > entry in CHANGES.txt saying this happened, and then show code on how > to get back to the the-less stopword list. > > New users using that StopFilter would properly see "the" filtered out. > Users who upgraded would need to fix their code to switch back to the > deprecated the-less list. > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 12:46 PM, DM Smith wrote: > I'm looking forward to the repackaging effort. I'm looking forward to it too! I can't wait for NumericRangeQuery... But: someone with serious ant skill set, and some time, needs to get the itch here and start iterating... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 12:43 PM, Mark Miller wrote: > Hmmm - thats starting to sound nastier. Its another barrier to upgrading to > a new jar. I have to monitor/hunt down and not miss all these little flags > so that docs/terms don't disappear from my index? There is already some of > that and I'd hate to see it grow. I'd rather have a stronger back compat > policy here I think. Its just one thing after another stacking up to make > upgrading a risk/pain at each jar update. I used to work on a project where > I upgraded Lucene often, and it was because it was so easy to keep dropping > in and picking new features as I wanted. We will really start pushing a > heavy onus onto our users if we fully adopt 1 and 2. New users will benefit, > but old users, unless they are Lucene hackers like you guys, will suffer. > Eventually our new users will be our old users. Well this is what I love about the actsAsVersion solution. There's no pain for our back-compat users (besides the one-time effort to set actsAsVersion), and new users always get the best settings. Or... we could consider encoding "actsAsVersion" into the index by default. Then, when IndexWriter asks the Analyzer for a tokenStream, it'd pass in the actsAsVersion, so that any tokenizers/filters in the chain would preserve their behavior as of that Lucene version. (You'd have to be able to turn this off, too). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Sorry, I wasn't quite sure what to call this new class you guys have been talking about. I was referring to the class that's being discussed to encapsulate all of the defaults for a given lucene release. (Its caching strategies etc etc) I'm just not certain that something like a static list of words belongs in a higher level defaults class like you guys are talking about, especially considering that anyone using a stop enabled analyzer really should be familiar with this list, and oftentimes needs to override it. Meh, now that I'm actually typing it out though, perhaps I'm incorrect here, assuming this class you guys are describing will be well advertised/documented maybe it will actually make it easier for end developers to twiddle around with this list, or at least certainly make them more aware that its even something that they have the ability to actually change. Matt Michael McCandless wrote: What is the "lucene defaults class"? Mike On Thu, May 21, 2009 at 12:37 PM, Matthew Hall wrote: For extreme examples like this, couldn't the stopword list be encapsulated into a single class that's used by the lucene defaults class. That way if you folks released updates to mostly static content like a stopword list, new or old users could get it easily with a simple drop in fix? Just my two cents. Matt Michael McCandless wrote: On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: even as simple as changing default stopword list for some analyzer could be an issue, if the user doesn't re-index in response to that change. OK, right. So say we forgot to include "the" in the default English stopwords list (yes, an extreme example...). Under the proposed changes 1 & 2 to back-compat policy, we would add "the" to the default stopword list, so new users get the fix, but still keep the the-less list accessible (deprecated). We'd add an entry in CHANGES.txt saying this happened, and then show code on how to get back to the the-less stopword list. New users using that StopFilter would properly see "the" filtered out. Users who upgraded would need to fix their code to switch back to the deprecated the-less list. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Michael McCandless wrote: On Thu, May 21, 2009 at 8:24 AM, DM Smith wrote: On May 21, 2009, at 7:17 AM, Michael McCandless wrote: 1) Default settings can change; we will always choose defaults based on "latest & greatest for new users". This only affects "runtime behavior". EG in 2.9, when sorting by field you won't get scores by default. When we do this we should clearly document the change, and what settings one could use to get back to the old behavior, in CHANGES.txt. I'd reverse 1 and 2 and note in 1 that the old behavior might be deprecated. OK. 2) An API, once released as deprecated, is fair game to be removed in the next minor release. I presume you mean that it will be present for at least one full minor release. So, if at 3.1.5 a deprecation is introduced, then it won't be removed until 3.3 at the earliest, because 3.2 was the first minor release in which it appeared at the start. I don't think it is fair to expect users to get every last point release. Right. We still only make bug fixes on point releases, support the index file format until the next major release -- those don't change. Is it just the index file format? I would hope that the behavior of filters, analyzers and such would not change so as to invalidate an index. Can you give an example of such changes? EG if we fix a bug in StandardAnalyzer, we will default it to fixed for new users and expect you on upgrading to read CHANGES.txt and change your app to set that setting to its non-defaulted value. I guess I'm not too concerned with bug fixes. I'm kind of a nut when it comes to correctness. But, I'd want to know that such a bug broke strict backward compatibility. I guess I don't want backward compatibility to get too much in the way of fixing bugs. (I think sometimes it has.) I wouldn't expect a compatibility flag to preserve buggy behavior. I guess I'm willing to go to extra effort to work with bug fixes. But I wouldn't expect others to feel the same way. Off the top of my head, in addition to Robert's stop word list, let's say that the filter that strips accents (I can't remember the name) is changed to be more than Latin-1 to ASCII folding. That would invalidate existing indexes. Or a new and improved filter is created to replace a class I use and the old class is deprecated. If that old class goes away, my index is invalidated. So if the stream of tokens out of an analyzer changes or the results of a filter is different, an index built with them is invalidated. If the output remains the same, I shouldn't care what has changed internally and probably don't care if the API has changed. I don't know if it matters to this discussion, but there's a lot in contrib that people (of which I am one :) expect to be stable. I'm looking forward to the repackaging effort. -- DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Michael McCandless wrote: On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: even as simple as changing default stopword list for some analyzer could be an issue, if the user doesn't re-index in response to that change. OK, right. So say we forgot to include "the" in the default English stopwords list (yes, an extreme example...). Under the proposed changes 1 & 2 to back-compat policy, we would add "the" to the default stopword list, so new users get the fix, but still keep the the-less list accessible (deprecated). We'd add an entry in CHANGES.txt saying this happened, and then show code on how to get back to the the-less stopword list. New users using that StopFilter would properly see "the" filtered out. Users who upgraded would need to fix their code to switch back to the deprecated the-less list. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org Hmmm - thats starting to sound nastier. Its another barrier to upgrading to a new jar. I have to monitor/hunt down and not miss all these little flags so that docs/terms don't disappear from my index? There is already some of that and I'd hate to see it grow. I'd rather have a stronger back compat policy here I think. Its just one thing after another stacking up to make upgrading a risk/pain at each jar update. I used to work on a project where I upgraded Lucene often, and it was because it was so easy to keep dropping in and picking new features as I wanted. We will really start pushing a heavy onus onto our users if we fully adopt 1 and 2. New users will benefit, but old users, unless they are Lucene hackers like you guys, will suffer. Eventually our new users will be our old users. I'm fully on the fence. I think relaxing will help development, but Lucene's stability has also been a strong quality. It would be nice to see it remain in some form. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
What is the "lucene defaults class"? Mike On Thu, May 21, 2009 at 12:37 PM, Matthew Hall wrote: > For extreme examples like this, couldn't the stopword list be encapsulated > into a single class that's used by the lucene defaults class. > > That way if you folks released updates to mostly static content like a > stopword list, new or old users could get it easily with a simple drop in > fix? > > Just my two cents. > > Matt > > Michael McCandless wrote: >> >> On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: >> >>> >>> even as simple as changing default stopword list for some analyzer could >>> be >>> an issue, if the user doesn't re-index in response to that change. >>> >> >> OK, right. >> >> So say we forgot to include "the" in the default English stopwords >> list (yes, an extreme example...). >> >> Under the proposed changes 1 & 2 to back-compat policy, we would add >> "the" to the default stopword list, so new users get the fix, but >> still keep the the-less list accessible (deprecated). We'd add an >> entry in CHANGES.txt saying this happened, and then show code on how >> to get back to the the-less stopword list. >> >> New users using that StopFilter would properly see "the" filtered out. >> Users who upgraded would need to fix their code to switch back to the >> deprecated the-less list. >> >> Mike >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > > -- > Matthew Hall > Software Engineer > Mouse Genome Informatics > mh...@informatics.jax.org > (207) 288-6012 > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711682#action_12711682 ] Michael McCandless commented on LUCENE-1614: bq. If they are all on -1 to start with, they are already all sorted. Right but that defeats the optimization. I'm talking about this code in ConjunctionScorer: {code} Arrays.sort(scorers, new Comparator() { // sort the array public int compare(Object o1, Object o2) { return ((Scorer)o1).doc() - ((Scorer)o2).doc(); } }); doNext(); // If first-time skip distance is any predictor of // scorer sparseness, then we should always try to skip first on // those scorers. // Keep last scorer in it's last place (it will be the first // to be skipped on), but reverse all of the others so that // they will be skipped on in order of original high skip. int end=(scorers.length-1); for (int i=0; i<(end>>1); i++) { Scorer tmp = scorers[i]; scorers[i] = scorers[end-i-1]; scorers[end-i-1] = tmp; } {code} Ie it sets things up so that "typically" the rarest sub-scorer drives the intersection. If they are all on -1 then this heuristic won't work. {quote} We could do some smart sorting in the constructor so that we skip in cheap and fast scorers first (TermScorers first, ordered by df, followed by simple conjunctions of terms, followed by other more expensive stuff like sloppy phrase queries and complex boolean queries. Perhaps in the future, even a method on Scorer that estimates it's cost? {quote} Right, we'd need to do something along these lines if we switch DISI to start with doc() = -1. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
For extreme examples like this, couldn't the stopword list be encapsulated into a single class that's used by the lucene defaults class. That way if you folks released updates to mostly static content like a stopword list, new or old users could get it easily with a simple drop in fix? Just my two cents. Matt Michael McCandless wrote: On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: even as simple as changing default stopword list for some analyzer could be an issue, if the user doesn't re-index in response to that change. OK, right. So say we forgot to include "the" in the default English stopwords list (yes, an extreme example...). Under the proposed changes 1 & 2 to back-compat policy, we would add "the" to the default stopword list, so new users get the fix, but still keep the the-less list accessible (deprecated). We'd add an entry in CHANGES.txt saying this happened, and then show code on how to get back to the the-less stopword list. New users using that StopFilter would properly see "the" filtered out. Users who upgraded would need to fix their code to switch back to the deprecated the-less list. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Matthew Hall Software Engineer Mouse Genome Informatics mh...@informatics.jax.org (207) 288-6012 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711678#action_12711678 ] Mark Miller commented on LUCENE-1595: - Right - the back compat for each contrib is completely up to that contrib. In the past though, anything thats 1.4 has stayed 1.4 without good reason so that users are not jolted (probably more out there using java 1.4 than you might think). On 3.0, when core goes 1.5, it will make sense to allow 1.5 in all the contribs that are 1.4 now. > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 12:19 PM, Robert Muir wrote: > even as simple as changing default stopword list for some analyzer could be > an issue, if the user doesn't re-index in response to that change. OK, right. So say we forgot to include "the" in the default English stopwords list (yes, an extreme example...). Under the proposed changes 1 & 2 to back-compat policy, we would add "the" to the default stopword list, so new users get the fix, but still keep the the-less list accessible (deprecated). We'd add an entry in CHANGES.txt saying this happened, and then show code on how to get back to the the-less stopword list. New users using that StopFilter would properly see "the" filtered out. Users who upgraded would need to fix their code to switch back to the deprecated the-less list. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711676#action_12711676 ] Yonik Seeley commented on LUCENE-1614: -- bq. But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers? If they are all on -1 to start with, they are already all sorted. We could do some smart sorting in the constructor so that we skip in cheap and fast scorers first (TermScorers first, ordered by df, followed by simple conjunctions of terms, followed by other more expensive stuff like sloppy phrase queries and complex boolean queries. Perhaps in the future, even a method on Scorer that estimates it's cost? > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711674#action_12711674 ] Michael McCandless commented on LUCENE-1595: Probably it's best to stick w/ 1.4. Someday, I hope, we will get to 3.0 :) > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
even as simple as changing default stopword list for some analyzer could be an issue, if the user doesn't re-index in response to that change. > Can you give an example of such changes? EG if we fix a bug in > StandardAnalyzer, we will default it to fixed for new users and expect > you on upgrading to read CHANGES.txt and change your app to set that > setting to its non-defaulted value. > > Mike > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711670#action_12711670 ] Michael McCandless commented on LUCENE-1614: We could also consider adding DISI.start (we discussed this under another issue). And maybe likewise DISI.finish -- there's a question on the user's list now "Do TermDocs and TermEnum need to be closed?" that notes that DISI never gives one a chance to close. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 8:24 AM, DM Smith wrote: > > On May 21, 2009, at 7:17 AM, Michael McCandless wrote: > >> 1) Default settings can change; we will always choose defaults based >> on "latest & greatest for new users". This only affects "runtime >> behavior". EG in 2.9, when sorting by field you won't get scores >> by default. When we do this we should clearly document the >> change, and what settings one could use to get back to the old >> behavior, in CHANGES.txt. > > I'd reverse 1 and 2 and note in 1 that the old behavior might be deprecated. OK. >> 2) An API, once released as deprecated, is fair game to be removed >> in the next minor release. > > I presume you mean that it will be present for at least one full minor > release. So, if at 3.1.5 a deprecation is introduced, then it won't be > removed until 3.3 at the earliest, because 3.2 was the first minor release > in which it appeared at the start. I don't think it is fair to expect users > to get every last point release. Right. >> We still only make bug fixes on point releases, support the index file >> format until the next major release -- those don't change. > > Is it just the index file format? I would hope that the behavior of filters, > analyzers and such would not change so as to invalidate an index. Can you give an example of such changes? EG if we fix a bug in StandardAnalyzer, we will default it to fixed for new users and expect you on upgrading to read CHANGES.txt and change your app to set that setting to its non-defaulted value. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1651: --- Fix Version/s: 2.9 > Make IndexReader.open() always return MSR to simplify (re-)opens. > - > > Key: LUCENE-1651 > URL: https://issues.apache.org/jira/browse/LUCENE-1651 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 2.9 >Reporter: Earwin Burrfoot >Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1651.patch > > > As per discussion in mailing list, I'm making DirectoryIndexReader.open() > always return MSR, even for single-segment indexes. > While theoretically valid in the past (if you make sure to keep your index > constantly optimized) this feature is made practically obsolete by > per-segment collection. > The patch somewhat de-hairies (re-)open logic for MSR/SR. > SR no longer needs an ability to pose as toplevel directory-owning IR. > All related logic is moved from DIR to MSR. > DIR becomes almost empty, and copying two or three remaining fields over to > MSR/SR, I remove it. > Lots of tests fail, as they rely on SR returned from IR.open(), I fix by > introducing SR.getOnlySegmentReader static package-private method. > Some previous bugs are uncovered, one is fixed in LUCENE-1645, another > (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711668#action_12711668 ] Michael McCandless commented on LUCENE-1651: Excellent! Thanks Earwin. bq. test-tag fails due to no-longer-correct test assumptions. Should I checkout the branch and make a separate patch, or how is it done? Yes, please make mods to the tags/XXX checkout, and include those mods in the patch. > Make IndexReader.open() always return MSR to simplify (re-)opens. > - > > Key: LUCENE-1651 > URL: https://issues.apache.org/jira/browse/LUCENE-1651 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 2.9 >Reporter: Earwin Burrfoot >Assignee: Michael McCandless > Attachments: LUCENE-1651.patch > > > As per discussion in mailing list, I'm making DirectoryIndexReader.open() > always return MSR, even for single-segment indexes. > While theoretically valid in the past (if you make sure to keep your index > constantly optimized) this feature is made practically obsolete by > per-segment collection. > The patch somewhat de-hairies (re-)open logic for MSR/SR. > SR no longer needs an ability to pose as toplevel directory-owning IR. > All related logic is moved from DIR to MSR. > DIR becomes almost empty, and copying two or three remaining fields over to > MSR/SR, I remove it. > Lots of tests fail, as they rely on SR returned from IR.open(), I fix by > introducing SR.getOnlySegmentReader static package-private method. > Some previous bugs are uncovered, one is fixed in LUCENE-1645, another > (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1651: -- Assignee: Michael McCandless > Make IndexReader.open() always return MSR to simplify (re-)opens. > - > > Key: LUCENE-1651 > URL: https://issues.apache.org/jira/browse/LUCENE-1651 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 2.9 >Reporter: Earwin Burrfoot >Assignee: Michael McCandless > Attachments: LUCENE-1651.patch > > > As per discussion in mailing list, I'm making DirectoryIndexReader.open() > always return MSR, even for single-segment indexes. > While theoretically valid in the past (if you make sure to keep your index > constantly optimized) this feature is made practically obsolete by > per-segment collection. > The patch somewhat de-hairies (re-)open logic for MSR/SR. > SR no longer needs an ability to pose as toplevel directory-owning IR. > All related logic is moved from DIR to MSR. > DIR becomes almost empty, and copying two or three remaining fields over to > MSR/SR, I remove it. > Lots of tests fail, as they rely on SR returned from IR.open(), I fix by > introducing SR.getOnlySegmentReader static package-private method. > Some previous bugs are uncovered, one is fixed in LUCENE-1645, another > (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SegmentReader instantiation
Michael McCandless wrote: On Thu, May 21, 2009 at 10:53 AM, Earwin Burrfoot wrote: I agree we should probably remove it, unless there are users relying on this. Maintaining side-by-side sources is difficult with time. As I said in the initial message, this feature introduces no runtime behaviour changes, so you can't really 'rely' on it and break if it's removed. Well maybe someone loves the performance improvement... and took it further by making their own native code extensions. I'm not sure how much these gains are. But people can get quite crazy when it comes to performance :) Can you send an email to java-user to take a quick survey on whether anyone is somehow needing this? Never subscribed there. Too low signal-to-noise ratio. I can, but .. is it a must? :) In fact I find many good ideas for improving Lucene come from our users, and one can't really understand what's important in Lucene without being grounded on how it's used. "Development" and "using" go hand in hand. The discussions that take place there spawn still more ideas, and following those dicussions causes me to think harder about the areas being discussed, so I learn more myself about Lucene and find more things to improve and ponder. Not to mention when there's a sneaky bug, it usually appears on the users list first. I jump a those ;) So, yeah, I think it is a must. It's likely nobody will respond after a few days, then we should remove gcj. I'll go ask if anyone is relying on gcj native code on java-user. Fedora uses Lucene for Eclipse and uses gcj for Eclipse. It might be used elsewhere. Don't know if that means they need the gcj stuff in Lucene. I just wish they'd rework to use openjdk. -- DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1651: Attachment: LUCENE-1651.patch Okay, here's the first patch, against latest trunk. test-core, test-contrib pass. test-tag fails due to no-longer-correct test assumptions. Should I checkout the branch and make a separate patch, or how is it done? > Make IndexReader.open() always return MSR to simplify (re-)opens. > - > > Key: LUCENE-1651 > URL: https://issues.apache.org/jira/browse/LUCENE-1651 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 2.9 >Reporter: Earwin Burrfoot > Attachments: LUCENE-1651.patch > > > As per discussion in mailing list, I'm making DirectoryIndexReader.open() > always return MSR, even for single-segment indexes. > While theoretically valid in the past (if you make sure to keep your index > constantly optimized) this feature is made practically obsolete by > per-segment collection. > The patch somewhat de-hairies (re-)open logic for MSR/SR. > SR no longer needs an ability to pose as toplevel directory-owning IR. > All related logic is moved from DIR to MSR. > DIR becomes almost empty, and copying two or three remaining fields over to > MSR/SR, I remove it. > Lots of tests fail, as they rely on SR returned from IR.open(), I fix by > introducing SR.getOnlySegmentReader static package-private method. > Some previous bugs are uncovered, one is fixed in LUCENE-1645, another > (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SegmentReader instantiation
On Thu, May 21, 2009 at 10:53 AM, Earwin Burrfoot wrote: >> I agree we should probably remove it, unless there are users relying >> on this. Maintaining side-by-side sources is difficult with time. > > As I said in the initial message, this feature introduces no runtime > behaviour changes, so you can't really 'rely' on it and break if it's > removed. Well maybe someone loves the performance improvement... and took it further by making their own native code extensions. I'm not sure how much these gains are. But people can get quite crazy when it comes to performance :) >> Can you send an email to java-user to take a quick survey on whether >> anyone is somehow needing this? > Never subscribed there. Too low signal-to-noise ratio. I can, but .. > is it a must? :) In fact I find many good ideas for improving Lucene come from our users, and one can't really understand what's important in Lucene without being grounded on how it's used. "Development" and "using" go hand in hand. The discussions that take place there spawn still more ideas, and following those dicussions causes me to think harder about the areas being discussed, so I learn more myself about Lucene and find more things to improve and ponder. Not to mention when there's a sneaky bug, it usually appears on the users list first. I jump a those ;) So, yeah, I think it is a must. It's likely nobody will respond after a few days, then we should remove gcj. I'll go ask if anyone is relying on gcj native code on java-user. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711653#action_12711653 ] Michael McCandless commented on LUCENE-1614: {quote} On the other end of the scale... getting rid of "if (firstTime)" is another check I've long wanted to eliminate. if doc() produced -1 the first time, before any calls to next() or skipTo(), we could get rid of the if (firstTime) code in ConjunctionScorer and others I think. The question is, would this be a burden to any scorers or DISI implementations? {quote} +1 Since we're changing DISI's semantics, now seems like a great time to make this change to. Eliminating the "if (firstTime)" from next() would be great. But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers? (Though, really, we ought to do that sort based on more accurate criteria, eg add a DIS.approxCount() (the first docID of each sub-scorer is an approximation that could easily be very wrong). If we had that, then in the ConjunctionScorer's ctor we would do the ordering). > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SegmentReader instantiation
2009/5/21 Michael McCandless : > It looks like this was done in order to implement > SegmentTermDocs.read(int[], int[]) natively, when using a gcj > environment, since that gave performance improvements? Yup, you're right. But something tells me, since Lucene 1.9 many things changed and this is no longer relevant. > I agree we should probably remove it, unless there are users relying > on this. Maintaining side-by-side sources is difficult with time. As I said in the initial message, this feature introduces no runtime behaviour changes, so you can't really 'rely' on it and break if it's removed. > Can you send an email to java-user to take a quick survey on whether > anyone is somehow needing this? Never subscribed there. Too low signal-to-noise ratio. I can, but .. is it a must? :) -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SegmentReader instantiation
It looks like this was done in order to implement SegmentTermDocs.read(int[], int[]) natively, when using a gcj environment, since that gave performance improvements? I agree we should probably remove it, unless there are users relying on this. Maintaining side-by-side sources is difficult with time. Also, this seems like something source code specialization (LUCENE-1594) should cover. Can you send an email to java-user to take a quick survey on whether anyone is somehow needing this? Mike On Thu, May 21, 2009 at 10:17 AM, Earwin Burrfoot wrote: > Right now a set of system properties and Class.newInstance() is used > to create SegmentReader. > > I've tracked down this code's origins to: > r150531 | cutting | 2004-09-22 22:32:27 +0400 (ср, 22 сен 2004) | 2 lines > Add GCJ native code for SegmentTermDocs.read(int[],int[]) to > accellerate TermScorer. TermScorer and BooleanScorer are now usually > a bit faster under GCJ than under Sun's JVM. > > Anybody uses this? Let's just go ahead, scrap it, and construct > classes using er.. constructors? :) > No adequate docs for this feature exist, except a very brief mention > in CHANGES.txt. > Even more - when ReadonlySegmentReader was introduced, instantiation > code was blindly copypasted, but no matching GCJReadonlySegmentReader > was created and corresponding Makefile wasn't altered, thus > GCJ-optimized Lucene is currently broken on trunk. > Removal of this feature also won't break any theoretical old apps that > magically happen to use it, because we'll simply ignore the property > with zero change in runtime behaviour. > > -- > Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) > Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 > ICQ: 104465785 > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
SegmentReader instantiation
Right now a set of system properties and Class.newInstance() is used to create SegmentReader. I've tracked down this code's origins to: r150531 | cutting | 2004-09-22 22:32:27 +0400 (ср, 22 сен 2004) | 2 lines Add GCJ native code for SegmentTermDocs.read(int[],int[]) to accellerate TermScorer. TermScorer and BooleanScorer are now usually a bit faster under GCJ than under Sun's JVM. Anybody uses this? Let's just go ahead, scrap it, and construct classes using er.. constructors? :) No adequate docs for this feature exist, except a very brief mention in CHANGES.txt. Even more - when ReadonlySegmentReader was introduced, instantiation code was blindly copypasted, but no matching GCJReadonlySegmentReader was created and corresponding Makefile wasn't altered, thus GCJ-optimized Lucene is currently broken on trunk. Removal of this feature also won't break any theoretical old apps that magically happen to use it, because we'll simply ignore the property with zero change in runtime behaviour. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: DateTools performance
Yes, please fix :) I think there may already be an issue open on the single instance / synchronization / ThreadLocal issue. Mike On Thu, May 21, 2009 at 9:52 AM, Shai Erera wrote: > How much is DateTools in use? I noticed a couple of potential improvements > to it, which at least for the benchmark package can improve performance: > > 1. timeToString calls Calendar.getInstance on every call? That's a very > expensive call to make. Why not store it as a static member? We always call > it with GMT timezone, and it reads internally the default Locale, so I don't > think it will change when the JVM is up, unless someone calls > Locale.setDefault() at some point. > > If we'll do this then we will need to make the method synchronized though > ... but I don't think that's too critical. > > 2. dateToString calls timeToString(date.getTime()), which then instantiates > a new Date(). Kind of wasteful, isn't it? > > 3. round(), which is called from timeToString (after creating a Calendarr > instace) creates another (!) Calendar instance ... > > I found one usage in QueryParser when it parses range queries and some more > in the test package. > > I don't mind fixing those. > > Shai > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1614: --- Attachment: LUCENE-1614.patch MAX_VAL as sentinel + the documentation changes + a new entry to CHANGES "back-compat change" on OBSI.nextDoc() and next(int) + tag fixes. All tests pass > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch, LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
DateTools performance
How much is DateTools in use? I noticed a couple of potential improvements to it, which at least for the benchmark package can improve performance: 1. timeToString calls Calendar.getInstance on every call? That's a very expensive call to make. Why not store it as a static member? We always call it with GMT timezone, and it reads internally the default Locale, so I don't think it will change when the JVM is up, unless someone calls Locale.setDefault() at some point. If we'll do this then we will need to make the method synchronized though ... but I don't think that's too critical. 2. dateToString calls timeToString(date.getTime()), which then instantiates a new Date(). Kind of wasteful, isn't it? 3. round(), which is called from timeToString (after creating a Calendarr instace) creates another (!) Calendar instance ... I found one usage in QueryParser when it parses range queries and some more in the test package. I don't mind fixing those. Shai
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711615#action_12711615 ] Shai Erera commented on LUCENE-1614: I plan to open another issue for 3.0 to take advantage of MAX_VAL being returned from doc() also (we cannot rely on doc() returning MAX_VAL today when there are no more docs, hence why we need to wait with these changes until 3.0). You're proposing to add another contract to doc() - to return -1 before nextDoc() and advance(int) were called. I can do that, but we can use this contract only in 3.0. Unless the community decides to change back-compat policy starting with 2.9, which will give us the opportunity to take advantage of "latest & greatest" right away. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711611#action_12711611 ] Yonik Seeley commented on LUCENE-1614: -- I'm warming to some of the simplifications that a MAX_VAL sentinel can bring. On the other end of the scale... getting rid of "if (firstTime)" is another check I've long wanted to eliminate. if doc() produced -1 the first time, before any calls to next() or skipTo(), we could get rid of the if (firstTime) code in ConjunctionScorer and others I think. The question is, would this be a burden to any scorers or DISI implementations? > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711605#action_12711605 ] Michael McCandless commented on LUCENE-1614: bq. So Mike - does that mean I can change nextDoc() behavior in OBSI and document it? OK let's tentatively go forward with that? > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711604#action_12711604 ] Michael McCandless commented on LUCENE-1648: OK -- good catch! I've reopened this (to remember this bug is present); let's fix this issue under "always use MSR". > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, > LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-1648: > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, > LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711597#action_12711597 ] Shai Erera commented on LUCENE-1614: bq. I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code. So Mike - does that mean I can change nextDoc() behavior in OBSI and document it? > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1648: Attachment: LUCENE-1648-followup.patch And here's the fix. The problem - it's not elegant at all. We can scrap it and I can do the fix within "always use MSR" patch, it looks way cleaner there. > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, > LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On May 21, 2009, at 7:17 AM, Michael McCandless wrote: 1) Default settings can change; we will always choose defaults based on "latest & greatest for new users". This only affects "runtime behavior". EG in 2.9, when sorting by field you won't get scores by default. When we do this we should clearly document the change, and what settings one could use to get back to the old behavior, in CHANGES.txt. I'd reverse 1 and 2 and note in 1 that the old behavior might be deprecated. 2) An API, once released as deprecated, is fair game to be removed in the next minor release. I presume you mean that it will be present for at least one full minor release. So, if at 3.1.5 a deprecation is introduced, then it won't be removed until 3.3 at the earliest, because 3.2 was the first minor release in which it appeared at the start. I don't think it is fair to expect users to get every last point release. If so +1 from a user. We still only make bug fixes on point releases, support the index file format until the next major release -- those don't change. Is it just the index file format? I would hope that the behavior of filters, analyzers and such would not change so as to invalidate an index. -- DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1648: Attachment: LUCENE-1648-followup.patch bq. Bad news is something is wrong w/ your patch, because MSR needs to carry over hasChanges/deletions/norms on clone(). Something is wrong with trunk. Try the test in the patch. It is absolutely identical to yours, except it adds another doc to index to have two segments and consequently MSR. > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648-followup.patch, LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711588#action_12711588 ] Shalin Shekhar Mangar commented on LUCENE-1614: --- bq. Perhaps the Solr guys can state then if and how much they mind this change? Before we start the journey of finding a different name for DISI.nextDoc(), just to deprecated OBSI.nextDoc() I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711586#action_12711586 ] Shai Erera commented on LUCENE-1595: BTW, am I allowed to use Java 5 generics in benchmark? Or until 3.0 benchmark should stay on 1.4 as well? I'm asking because I heard a couple of times that contrib is allowed to move to Java 5 > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1647) IndexReader.undeleteAll can mess up the deletion count stored in the segments file
[ https://issues.apache.org/jira/browse/LUCENE-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1647: --- Attachment: LUCENE-1647.patch Attached patch w/ test showing the issue, and fix that fixes the test. I plan to commit in a day or two. > IndexReader.undeleteAll can mess up the deletion count stored in the segments > file > -- > > Key: LUCENE-1647 > URL: https://issues.apache.org/jira/browse/LUCENE-1647 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1647.patch, LUCENE-1647.patch > > > Spinoff from LUCENE-1474. I'll attach a test case showing the issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711583#action_12711583 ] Shai Erera commented on LUCENE-1595: bq. Maybe make the seed an optional config? If it's not present, let it pick a random seed? I already went ahead and did that. Only if it's not present, I chose 13 instead of drawing one every time. That way, runs can be consistent and compared to each other. > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711580#action_12711580 ] Shai Erera commented on LUCENE-1614: bq. My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1 Perhaps the Solr guys can state then if and how much they mind this change? Before we start the journey of finding a different name for DISI.nextDoc(), just to deprecated OBSI.nextDoc() ... > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711581#action_12711581 ] Michael McCandless commented on LUCENE-1595: bq. While I change SortableSingleDocMaker I noticed it create a new Random() in getNextDocData(). Shouldn't that Random be created once? Also, I think it should be created with a seed? OK, sharing a Random instance seems good. Maybe make the seed an optional config? If it's not present, let it pick a random seed? > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1648. Resolution: Fixed > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711578#action_12711578 ] Michael McCandless commented on LUCENE-1648: {quote} Or to be more exact, it fixed the tests that were failing for me. Your two new tests fail as soon as I apply "always use MSR", because MSR does not carry over hasChanges/hasDeletions on clone(). {quote} Hmm, OK so it's a good news / bad news situation. Good news is patch resolved your original issue. I'll commit it shortly. Bad news is something is wrong w/ your patch, because MSR needs to carry over hasChanges/deletions/norms on clone(). > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 7:21 AM, Shai Erera wrote: > I thought that the index file format is supposed to be supported until the > 2nd major release. I.e. 3.0 will still read 2.0 indexes, but 4.0 won't. Is > that what you meant, or am I wrong? Woops, you're correct: http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711576#action_12711576 ] Michael McCandless commented on LUCENE-1614: bq. I think I'll emphasize that in the javadocs, documenting the limitation of MAX_VAL so that people won't assume the wrong things. +1 A docID is different from an "int", because docIDs must be 0 .. MAX_VAL-1. {quote} Why? just because OBSI declared a method which we wanted anyway? You know .. it's something we don't give much thought to when we add methods to abstract classes, but what if someone extended DISI and added his own advance(int) or nextDoc() which don't behave like we expect them to. When he'll pass his DISI to the search flow somehow, not knowing these have become the primary methods, something will break. I'm not saying we should protect these cases too, because otherwise we won't be able to make any changes. But just because OBSI had nextDoc() declared doesn't mean we should go and find a different name. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods. {quote} You're right, a random subclass of an abstract class could very well choose the name we are wanting to add, and then their class fails to compile, or (if the sigs turn out to be identical) runs bug possibly causes problems. But in this case we know we have just such a class that has done so (OBSI). And of course it did so for exactly the reasons that we are now wanting to add nextDoc to DISI. My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1 when it's done and we're gonna cause AIOOB exceptions if we up and change to returning MAX_VAL. bq. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods. As much as it bothers me having to accept inferior names (so they don't conflict with the existing names), I think it's very much the lesser-of-evils here. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711577#action_12711577 ] Shai Erera commented on LUCENE-1595: While I change SortableSingleDocMaker I noticed it create a new Random() in getNextDocData(). Shouldn't that Random be created once? Also, I think it should be created with a seed? > Split DocMaker into ContentSource and DocMaker > -- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Mark Miller > Fix For: 2.9 > > > This issue proposes some refactoring to the benchmark package. Today, > DocMaker has two roles: collecting documents from a collection and preparing > a Document object. These two should actually be split up to ContentSource and > DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like > getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ > 1591, by having a basic ContentSource that offers input stream services, and > wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the > same whether I create documents using DocState, add payloads or index > additional metadata. Same goes for Trec and Reuters collections, as well as > LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are > 99% the same and 99% different. Most of their differences lie in the way they > read the data, while most of the similarity lies in the way they create > documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker > (just the reuse of DocState). Also, other DocMakers do not use that DocState > today, something they could have gotten for free with this refactoring > proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, > Line, Simple), I can write several DocMakers, such as DocStateMaker, > ConfigurableDocMaker (one which accpets all kinds of config options) and > custom DocMakers (payload, facets, sorting), passing to them a ContentSource > instance and reuse the same DocMaking algorithm with many content sources, as > well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone > (i.e., compare bzip, gzip and regular input streams), w/o the overhead of > creating a Document object. > I've already done so in my code environment (I extend the benchmark package > for my application's purposes) and I like the flexibility I have. I think > this can be a nice contribution to the benchmark package, which can result in > some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711575#action_12711575 ] Earwin Burrfoot commented on LUCENE-1648: - Or to be more exact, it fixed the tests that were failing for me. Your two new tests fail as soon as I apply "always use MSR", because MSR does not carry over hasChanges/hasDeletions on clone(). > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711571#action_12711571 ] Earwin Burrfoot commented on LUCENE-1648: - bq. Try the patch? Yup, it fixed everything. > when you clone or reopen an IndexReader with pending changes, the new reader > doesn't commit the changes > --- > > Key: LUCENE-1648 > URL: https://issues.apache.org/jira/browse/LUCENE-1648 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1648.patch > > > While working on LUCENE-1647, I came across this issue... we are failing to > carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1646) QueryParser throws new exceptions even if custom parsing logic threw a better one
[ https://issues.apache.org/jira/browse/LUCENE-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711570#action_12711570 ] Michael McCandless commented on LUCENE-1646: bq. I guess that's true if you look at exceptions as a logging mechanism, but in our case it's a parsing exception for text coming from the user. Because of this, our use case is for the user to get a useful error message, and it's not useful at all if we just tell them their entire query was bad. Thus we have inserted improvements (in our subclass) to make it complain only about the fragment of the query which is actually a problem, so they know which part to fix. I agree, in general an app ought to be catching this ParseException and messaging something much more succinct/digestible to the end user. When QueryParser throws ParseException, it does attempt to explain what went wrong in many cases. It sounds like you've modified it further to do a better job? Is this something you could donate back? I agree we should make it easier to customize the ParseException, but I think by default having the exception message include the full query is useful for those apps that simply log the exception. This way we can have our cake and eat it too... bq. Related, but is there any way it could at least be reduced to the portion of the query which caused the problem? In a way it would be nice if ParseException had methods to get out the problematic fragment (my subclass has it...) I'm guessing this is much easier for exceptions relating to values inside fields which otherwise parsed correctly, but a lot harder to do for exceptions from the parser proper. It seems like it already makes some effort here? Eg it has attrs like currentToken, expectedTokenSequences, etc., recorded when the exception is thrown. And getMessage() pulls from these attrs to build up the message? Can you post a patch with your improvements here? > QueryParser throws new exceptions even if custom parsing logic threw a better > one > - > > Key: LUCENE-1646 > URL: https://issues.apache.org/jira/browse/LUCENE-1646 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Trejkaz > > We have subclassed QueryParser and have various custom fields. When these > fields contain invalid values, we throw a subclass of ParseException which > has a more useful message (and also a localised message.) > Problem is, Lucene's QueryParser is doing this: > {code} > catch (ParseException tme) { > // rethrow to include the original query: > throw new ParseException("Cannot parse '" +query+ "': " + > tme.getMessage()); > } > {code} > Thus, our nice and useful ParseException is thrown away, replaced by one with > no information about what's actually wrong with the query (it does append > getMessage() but that isn't localised. And it also throws away the > underlying cause for the exception.) > I am about to patch our copy to simply remove these four lines; the caller > knows what the query string was (they have to have a copy of it because they > are passing it in!) so having it in the error message itself is not useful. > Furthermore, when the query string is very big, what the user wants to know > is not that the whole query was bad, but which part of it was bad. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
I thought that the index file format is supposed to be supported until the 2nd major release. I.e. 3.0 will still read 2.0 indexes, but 4.0 won't. Is that what you meant, or am I wrong? Shai On Thu, May 21, 2009 at 2:17 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > OK so it sounds like we've boiled the proposal down to two concrete > changes to the back-compat policy: > > 1) Default settings can change; we will always choose defaults based > on "latest & greatest for new users". This only affects "runtime > behavior". EG in 2.9, when sorting by field you won't get scores > by default. When we do this we should clearly document the > change, and what settings one could use to get back to the old > behavior, in CHANGES.txt. > > 2) An API, once released as deprecated, is fair game to be removed > in the next minor release. > > We still only make bug fixes on point releases, support the index file > format until the next major release -- those don't change. > > Mike > > On Wed, May 20, 2009 at 11:34 PM, Shai Erera wrote: > >> With the new way, you can get the first bug fix release, but then you > will > >> quickly be left out of new bug fixes until you update your code. > > > > Mark, apologies for the late reference, but it struck me only after I > left > > the computer yesterday. Again, I'm not sure how bit of a problem is it. > > Suppose that I upgrade to 2.4 and the next version (6 months?) is 2.9. > Then > > a bug is discovered and is fixed in 2.4.1 and 2.9. In addition, 2.9 > contains > > all these changes in Collectors. When 2.9 is out I decide not to upgrade > to > > 2.9 because I don't have time. When 3.0 comes out (3-4 months later?) I > will > > be forced to upgrade. That means ~1 year since I last upgraded my Lucene > > code sections. > > (True, if there will be any bug fixes in 2.9, I will miss them). > > > > How unreasonable is to ask this? Seriously, how many apps are not touched > > *at all* during one year? And even if these code segments are stable and > no > > one touches them anymore, how problematic is it to request users to once > a > > year do a sort of cleanup and update to their code? > > > >> In other words, we keep deprecated around for only one or two versions. > > > > That is a reasonable approach. Meaning, defaults may change between > releases > > because we'd like Lucene users to get the latest & greatest stuff, (and > also > > count on the fact not so many out there strongly rely on the defaults?) > but > > methods removal/rename should cause a little more trouble, so we can give > > the users one release to arrange their time before they have to do > anything. > > > > Maybe the TokenStream API needs to stay deprecated for longer, until > we're > > sure everybody is happy with the new API. > > > > Shai > > > > On Thu, May 21, 2009 at 4:23 AM, Grant Ingersoll > > wrote: > >> > >> On May 20, 2009, at 4:06 PM, Michael McCandless wrote: > >> > >>> On Wed, May 20, 2009 at 3:24 PM, Shai Erera wrote: > > Then why go through all this trouble and not simply change the > back-compat > policy? > >>> > >>> Back-compat is insanely costly, especially the longer it takes us to > >>> get to the next major release... yet, the specific cost that bothers > >>> me the most is that we hurt our new users because of the back-compat > >>> users. It hurts Lucene's adoption/growth. > >>> > >>> Another consideration on relaxing policy is that back-compat is well > >>> nigh impossible to actually achieve. We spend an insane amount of our > >>> energy maintaining back-compat, but then one accidental breakage that > >>> slips through quickly causes many back-compat users to conclude we are > >>> not back-compat. It's not much bang and alot of buck. > >>> > >>> It is tempting to change our policy to something like: > >>> > >>> * Bug fixes only on each 2.4.X release > >>> > >>> * Anything can change on each 2.X release, but any prior 2.Y index > >>> format is readable > >>> > >>> I think it's not unreasonable to say "if you want to take advantage of > >>> Lucene's perf improvements and new features, on upgrading you'll have > >>> to recompile, fix APIs, etc.". > >> > >> > >> All reasonable, Mike. My take is that Lucene has always been pragmatic > >> about darn near everything, except back compat, where we are pretty > >> dogmatic. > >> > >> In general, I think it is reasonable to say that even from 2.x to 2.y we > >> will try to be back compatible, but when we deem it necessary, we > reserve > >> the right to change things. I don't think anyone here is suggesting we > >> would ever do something drastic like a complete overhaul of all the APIs > in > >> a version change. I also think it is reasonable to deprecate things by > >> saying @deprecated Will be removed in 2.Y. Use coolNewMethod instead. > In > >> other words, we keep deprecated around for only one or two versions. Of > >> course, the timing can vary. Things like changi
Re: Lucene's default settings & back compatibility
OK so it sounds like we've boiled the proposal down to two concrete changes to the back-compat policy: 1) Default settings can change; we will always choose defaults based on "latest & greatest for new users". This only affects "runtime behavior". EG in 2.9, when sorting by field you won't get scores by default. When we do this we should clearly document the change, and what settings one could use to get back to the old behavior, in CHANGES.txt. 2) An API, once released as deprecated, is fair game to be removed in the next minor release. We still only make bug fixes on point releases, support the index file format until the next major release -- those don't change. Mike On Wed, May 20, 2009 at 11:34 PM, Shai Erera wrote: >> With the new way, you can get the first bug fix release, but then you will >> quickly be left out of new bug fixes until you update your code. > > Mark, apologies for the late reference, but it struck me only after I left > the computer yesterday. Again, I'm not sure how bit of a problem is it. > Suppose that I upgrade to 2.4 and the next version (6 months?) is 2.9. Then > a bug is discovered and is fixed in 2.4.1 and 2.9. In addition, 2.9 contains > all these changes in Collectors. When 2.9 is out I decide not to upgrade to > 2.9 because I don't have time. When 3.0 comes out (3-4 months later?) I will > be forced to upgrade. That means ~1 year since I last upgraded my Lucene > code sections. > (True, if there will be any bug fixes in 2.9, I will miss them). > > How unreasonable is to ask this? Seriously, how many apps are not touched > *at all* during one year? And even if these code segments are stable and no > one touches them anymore, how problematic is it to request users to once a > year do a sort of cleanup and update to their code? > >> In other words, we keep deprecated around for only one or two versions. > > That is a reasonable approach. Meaning, defaults may change between releases > because we'd like Lucene users to get the latest & greatest stuff, (and also > count on the fact not so many out there strongly rely on the defaults?) but > methods removal/rename should cause a little more trouble, so we can give > the users one release to arrange their time before they have to do anything. > > Maybe the TokenStream API needs to stay deprecated for longer, until we're > sure everybody is happy with the new API. > > Shai > > On Thu, May 21, 2009 at 4:23 AM, Grant Ingersoll > wrote: >> >> On May 20, 2009, at 4:06 PM, Michael McCandless wrote: >> >>> On Wed, May 20, 2009 at 3:24 PM, Shai Erera wrote: Then why go through all this trouble and not simply change the back-compat policy? >>> >>> Back-compat is insanely costly, especially the longer it takes us to >>> get to the next major release... yet, the specific cost that bothers >>> me the most is that we hurt our new users because of the back-compat >>> users. It hurts Lucene's adoption/growth. >>> >>> Another consideration on relaxing policy is that back-compat is well >>> nigh impossible to actually achieve. We spend an insane amount of our >>> energy maintaining back-compat, but then one accidental breakage that >>> slips through quickly causes many back-compat users to conclude we are >>> not back-compat. It's not much bang and alot of buck. >>> >>> It is tempting to change our policy to something like: >>> >>> * Bug fixes only on each 2.4.X release >>> >>> * Anything can change on each 2.X release, but any prior 2.Y index >>> format is readable >>> >>> I think it's not unreasonable to say "if you want to take advantage of >>> Lucene's perf improvements and new features, on upgrading you'll have >>> to recompile, fix APIs, etc.". >> >> >> All reasonable, Mike. My take is that Lucene has always been pragmatic >> about darn near everything, except back compat, where we are pretty >> dogmatic. >> >> In general, I think it is reasonable to say that even from 2.x to 2.y we >> will try to be back compatible, but when we deem it necessary, we reserve >> the right to change things. I don't think anyone here is suggesting we >> would ever do something drastic like a complete overhaul of all the APIs in >> a version change. I also think it is reasonable to deprecate things by >> saying @deprecated Will be removed in 2.Y. Use coolNewMethod instead. In >> other words, we keep deprecated around for only one or two versions. Of >> course, the timing can vary. Things like changing the Document stuff like >> we've talked about might last longer (or shorter, actually) while minor >> deprecations may only be kept for one. The index compatibility stuff is a >> must. >> >> It is probably worthwhile to ask on java-user@ how many people rely on our >> back compat policies. >> >> Of course, we do already support back compat for all >> versions: svn checkout >> http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_3_1/ > cheek> >> >> >>
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711555#action_12711555 ] Shai Erera commented on LUCENE-1614: BTW, regarding SortedVIntList - even though it extends DocIdSet, its javadocs start with "Store and iterate sorted integers in compressed form in RAM." - doc Ids are not mentioned. Also, the class is public, so nothing prevents someone from using it for integers that are not Doc Ids. I think I'll emphasize that in the javadocs, documenting the limitation of MAX_VAL so that people won't assume the wrong things. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711553#action_12711553 ] Shai Erera commented on LUCENE-1614: bq. SortedVIntList subclasses DocIdSet Sorry, did not notice that. It's just that the test confused me, since I though it just stores VInts with no direct relation to doc Ids. bq. maybe we need to choose a different name than nextDoc() Why? just because OBSI declared a method which we wanted anyway? You know .. it's something we don't give much thought to when we add methods to abstract classes, but what if someone extended DISI and added his own advance(int) or nextDoc() which don't behave like we expect them to. When he'll pass his DISI to the search flow somehow, not knowing these have become the primary methods, something will break. I'm not saying we should protect these cases too, because otherwise we won't be able to make any changes. But just because OBSI had nextDoc() declared doesn't mean we should go and find a different name. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods. Can't we just document in CHANGES that nextDoc() now returns MAX_VAL when no more docs exist, and we fix the test in tag? I mean, how many users do we think use OBSI directly? bq. MAX_VAL is not a valid docID I'll remove the test then (from trunk and tag) and document on DISI this assumption. > Add next() and skipTo() variants to DocIdSetIterator that return the current > doc, instead of boolean > > > Key: LUCENE-1614 > URL: https://issues.apache.org/jira/browse/LUCENE-1614 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1614.patch > > > See > http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html > for the full discussion. The basic idea is to add variants to those two > methods that return the current doc they are at, to save successive calls to > doc(). If there are no more docs, return -1. A summary of what was discussed > so far: > # Deprecate those two methods. > # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI > (calls next() and skipTo() respectively, and will be changed to abstract in > 3.0). > #* I actually would like to propose an alternative to the names: advance() > and advance(int) - the first advances by one, the second advances to target. > # Wherever these are used, do something like '(doc = advance()) >= 0' instead > of comparing to -1 for improved performance. > I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org