Re: 5.0 release status?
To be clear, I myself am not trying to offer advice on whether or when people should upgrade – I’m trying solely to determine if there is significant value to do so, and what that value might be. I did indeed read through Robert’s list and have watched the Jira flow over the years, but I am unable to pinpoint “significant” improvements that will have more than just a “minor” impact for users. I’m not trying to say that significant improvements aren’t actually in there, just that I don’t know of any. If I am wrong, please provide the details. Like... are there use cases where the 5.0 index will be at least 10% faster or at least 10% smaller, and if so, which specific features and use cases? Or if there is a cumulative improvement in performance or capacity. Or... if there are specific feature transitions to recommend that would result in dramatic improvements. I mean, as things stand, there has been a lot of “shuffling around”, but no clear, quantified insight on the benefits of that shuffling/refactoring. I’m all for cleaner code (which can manifest as more reliable and less bugs), but is that is gist of most of the index changes? In short, I’m more interested in the impact of the 5.0 index changes (and their use cases), not the details of the implementation of those changes. Put another way, will a typical app be at least 10% faster or 10% smaller (or both!) when its index is converted from 4.x to 5.0? Or 5% or 20% or... whatever it actually is? And if there are specific new features that rely on conversion to 5.0 index format, lets get that list collected as some bullet points. Call this preparation for the 5.0 release! Maybe it could be a summary section in the 5.0 migration guide. Clearly there is plenty of goodness in the 5.0 work, but I’m just trying to get a handle on the overall impact. -- Jack Krupansky From: Ryan Ernst Sent: Sunday, October 5, 2014 12:48 AM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Oct 4, 2014 9:35 PM, Jack Krupansky j...@basetechnology.com wrote: Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all 4.x indexes, or some, or under some conditions? I mean, I had the impression that it was only non-GA 4.0 indexes. And was it only 4.10 that was doing this, or 4.0 GA through 4.9 as well? The bug only affected people using the 4.10.0 release to read 4.0 beta/final segments (it thought they were 3x indexes). In any case, I’m still not clear on the direct benefits to users of, say, 4.9 upgrading to 5.0 indexes. Any performance improvement? Any disk space reduction? Any RAM reduction? Again, read through all the stuff Robert has mentioned, read through lucene/CHANGES.txt, read the issues that are currently open. Your previous comments have suggested users upgrading to 5.0 would only do so so they can eventually upgrade to 6.0, implying they wouldn't upgrade their indexes for minor releases. This simply is not the best advice. Look back at 4.9 and 4.10 for recent improvements in heap usage for doc values and norms for example. Going back farther, someone still on 4.0 doesn't benefit from the postings format improvements in 4.1. Users should upgrade their format whenever possible because improvements are always happening. -- Jack Krupansky From: Ryan Ernst Sent: Sunday, October 5, 2014 12:24 AM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote: Thanks for the further clarification. In short, the legacy of 3.x support was destabilizing 4.x itself (including testing), not just interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x will have less baggage holding it down than 4.x has today. I still need answers to: 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise upgrading their 4.x indexes to 5.0? Yes, for all the reasons Robert already mentioned. 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 4.x indexes to 5.0 so that they will not have to worry or do anything when 6.0 comes out? Again, users should always upgrade if possible. There are improvements for memory and speed all the time. Currently they can use the IndexUpgrader (offline) or wrap there merge policy with UpgradeIndexMergePolicy (although both currently act like an optimize on the old segments, im hoping to change that soon). Ryan -- Jack Krupansky -Original Message- From: Robert Muir Sent: Saturday, October 4, 2014 10:43 PM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT
Re: 5.0 release status?
Why don't you just do as ryan suggests and read the JIRA issues. I already outlined things like the memory improvements being made in the new format and for merging. I don't need to summarize it again. And we don't have to justify fixing back compat corruptions with new features. It is the other way around. If anyone doesn't like this approach to unfucking these problems for 5.0 and wants to continue with 4.x releases, then they need to step up to the plate and do the work. Thats where the problem always is: lots of people that want to whine about back compat, but don't want to actually do any work. On Sun, Oct 5, 2014 at 9:30 AM, Jack Krupansky j...@basetechnology.com wrote: To be clear, I myself am not trying to offer advice on whether or when people should upgrade – I’m trying solely to determine if there is significant value to do so, and what that value might be. I did indeed read through Robert’s list and have watched the Jira flow over the years, but I am unable to pinpoint “significant” improvements that will have more than just a “minor” impact for users. I’m not trying to say that significant improvements aren’t actually in there, just that I don’t know of any. If I am wrong, please provide the details. Like... are there use cases where the 5.0 index will be at least 10% faster or at least 10% smaller, and if so, which specific features and use cases? Or if there is a cumulative improvement in performance or capacity. Or... if there are specific feature transitions to recommend that would result in dramatic improvements. I mean, as things stand, there has been a lot of “shuffling around”, but no clear, quantified insight on the benefits of that shuffling/refactoring. I’m all for cleaner code (which can manifest as more reliable and less bugs), but is that is gist of most of the index changes? In short, I’m more interested in the impact of the 5.0 index changes (and their use cases), not the details of the implementation of those changes. Put another way, will a typical app be at least 10% faster or 10% smaller (or both!) when its index is converted from 4.x to 5.0? Or 5% or 20% or... whatever it actually is? And if there are specific new features that rely on conversion to 5.0 index format, lets get that list collected as some bullet points. Call this preparation for the 5.0 release! Maybe it could be a summary section in the 5.0 migration guide. Clearly there is plenty of goodness in the 5.0 work, but I’m just trying to get a handle on the overall impact. -- Jack Krupansky From: Ryan Ernst Sent: Sunday, October 5, 2014 12:48 AM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Oct 4, 2014 9:35 PM, Jack Krupansky j...@basetechnology.com wrote: Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all 4.x indexes, or some, or under some conditions? I mean, I had the impression that it was only non-GA 4.0 indexes. And was it only 4.10 that was doing this, or 4.0 GA through 4.9 as well? The bug only affected people using the 4.10.0 release to read 4.0 beta/final segments (it thought they were 3x indexes). In any case, I’m still not clear on the direct benefits to users of, say, 4.9 upgrading to 5.0 indexes. Any performance improvement? Any disk space reduction? Any RAM reduction? Again, read through all the stuff Robert has mentioned, read through lucene/CHANGES.txt, read the issues that are currently open. Your previous comments have suggested users upgrading to 5.0 would only do so so they can eventually upgrade to 6.0, implying they wouldn't upgrade their indexes for minor releases. This simply is not the best advice. Look back at 4.9 and 4.10 for recent improvements in heap usage for doc values and norms for example. Going back farther, someone still on 4.0 doesn't benefit from the postings format improvements in 4.1. Users should upgrade their format whenever possible because improvements are always happening. -- Jack Krupansky From: Ryan Ernst Sent: Sunday, October 5, 2014 12:24 AM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote: Thanks for the further clarification. In short, the legacy of 3.x support was destabilizing 4.x itself (including testing), not just interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x will have less baggage holding it down than 4.x has today. I still need answers to: 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise upgrading their 4.x indexes to 5.0? Yes, for all the reasons Robert already mentioned. 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 4.x indexes to 5.0 so that they will not have to worry or do anything when 6.0 comes out? Again, users should always upgrade if possible. There are improvements for memory and speed all the time. Currently they can use
5.0 release status?
I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? AFAICT, there isn’t anything super major in 5x that the world is super-urgently waiting for (WAR vs. server?), and people have been really good at making substantial enhancements in the 4x branch, so I would suggest that anybody strongly favoring an imminent 5.0 release (next six months) should make their case more explicitly. Otherwise, it seems like we can continue to look at an ongoing stream of significant improvements to the 4x branch and that a 5.0 is probably at least a year or so off – or simply waiting on some major change that actually warrants a 5.0. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? Comments? -- Jack Krupansky
Re: 5.0 release status?
On 10/4/2014 10:35 AM, Jack Krupansky wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? AFAICT, there isn’t anything super major in 5x that the world is super-urgently waiting for (WAR vs. server?), and people have been really good at making substantial enhancements in the 4x branch, so I would suggest that anybody strongly favoring an imminent 5.0 release (next six months) should make their case more explicitly. Otherwise, it seems like we can continue to look at an ongoing stream of significant improvements to the 4x branch and that a 5.0 is probably at least a year or so off – or simply waiting on some major change that actually warrants a 5.0. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? I think you're right when you say that freeing trunk from compatibility hell is a primary goal. In SVN, branch_4x has been eliminated and branch_5x now exists. We took a roundabout path -- if I grok it correctly, branch_4x was renamed to branch_5x and large-scale code changes were backported from trunk. That must have been quite a job, so many thanks to Robert for that effort. I think that any further 4.x releases will only be point releases for bugfixes on 4.10. We currently don't have an easy way to build a new 4.x release, so the next feature release will be 5.0. At this moment, branch_5x builds a war, not a server application. I'm still interested in changing that, and I believe that is the plan, but as far as I know, no actual work has been done on the transition. That work is likely to take a while to become stable, so a timely 5.0 release required restoring the war to 5x. I am fairly sure the work for a standalone Solr server will happen on trunk, and if the changes aren't extraordinarily drastic, we can port the alternate build target to 5.x, and make it the default build target in a later release. Since 5.0 will still build a .war file, we probably need to make a servlet version available for all 5.x releases. Stay tuned for info on how that gets managed, because I have no idea. :) Perhaps breaking up the download into smaller bits can happen on the 5x branch. What I've seen from Heliosearch looks really awesome, though I haven't actually tried it yet. I'd like to see where that goes. GC pauses can be a big problem, so reducing the amount of memory that requires GC is a great goal. For elasticsearch, I have zero information. We probably won't get 5.0 out the door before the end of the year, but it would be awesome if we did. Hopefully it won't take six months, though that wouldn't surprise me. I'm doing what I can for the cause, by running a larger test suite than normal. We've got some insane resource requirements for some of our non-default tests! The @Monster designation is fitting. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 5.0 release status?
The branch_5x effort is to release what would have been 4.11 as 5.0. The most notable reason being backcompat for 3x indexes, which as Robert has put it is unmaintainable. AFAICT, there isn’t anything super major in 5x that the world is super-urgently waiting for (WAR vs. server?) The WAR removal was not backported to 5x. It is still on trunk, to be dealt with at a later time. Otherwise, it seems like we can continue to look at an ongoing stream of significant improvements to the 4x branch and that a 5.0 is probably at least a year or so off I don't believe this is correct. The intent here is to have the next release of Lucene be 5.0. Robert has put in a great deal of effort in making improvements in a new Lucene50 codec that were simply not possible on 4x. or simply waiting on some major change that actually warrants a 5.0. There are already some major changes in 5.0: nio2, tons more index corruption protection, super improved debugging for memory allocation of index structures, simpler tokenizer/analyzer interface without Reader, ram usage improvements with the 50 codec work so far. I know I have a list of things I'd like to do API-wise. IMO, a few months, maybe more. On Sat, Oct 4, 2014 at 9:35 AM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? AFAICT, there isn’t anything super major in 5x that the world is super-urgently waiting for (WAR vs. server?), and people have been really good at making substantial enhancements in the 4x branch, so I would suggest that anybody strongly favoring an imminent 5.0 release (next six months) should make their case more explicitly. Otherwise, it seems like we can continue to look at an ongoing stream of significant improvements to the 4x branch and that a 5.0 is probably at least a year or so off – or simply waiting on some major change that actually warrants a 5.0. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? Comments? -- Jack Krupansky
Re: 5.0 release status?
Thanks for the clarification! I do indeed recall now that portion of the discussion about renaming of branch_4x to branch_5x with a lot/most of what had previously been trunk, with the most major exception being the trunk war/server changes. To make the long story short, the next non-patch release of Lucene and Solr will be... 5.0, not 4.11. So, 5.0 should be out, like within the next couple of months. In terms of the impact on anybody for compatibility, the only big thing is that 5.0 will not support 3.x indexes. It will fully support the 4.x indexes though, correct? Will there be any benefit or reason for people to upgrade their 4.x indexes to 5.0? One reason I can think of is so that they will be able to jump from 5.x to 6.0, otherwise 6.0 would refuse to accept their 4.x indexes. Can a 4.x index be easily upgraded to be a 5.x index, like using a utility or optimize? Do I have everything straight now? -- Jack Krupansky From: Ryan Ernst Sent: Saturday, October 4, 2014 3:57 PM To: dev@lucene.apache.org Subject: Re: 5.0 release status? The branch_5x effort is to release what would have been 4.11 as 5.0. The most notable reason being backcompat for 3x indexes, which as Robert has put it is unmaintainable. AFAICT, there isn’t anything super major in 5x that the world is super-urgently waiting for (WAR vs. server?) The WAR removal was not backported to 5x. It is still on trunk, to be dealt with at a later time. Otherwise, it seems like we can continue to look at an ongoing stream of significant improvements to the 4x branch and that a 5.0 is probably at least a year or so off I don't believe this is correct. The intent here is to have the next release of Lucene be 5.0. Robert has put in a great deal of effort in making improvements in a new Lucene50 codec that were simply not possible on 4x. or simply waiting on some major change that actually warrants a 5.0. There are already some major changes in 5.0: nio2, tons more index corruption protection, super improved debugging for memory allocation of index structures, simpler tokenizer/analyzer interface without Reader, ram usage improvements with the 50 codec work so far. I know I have a list of things I'd like to do API-wise. IMO, a few months, maybe more. On Sat, Oct 4, 2014 at 9:35 AM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? AFAICT, there isn’t anything super major in 5x that the world is super-urgently waiting for (WAR vs. server?), and people have been really good at making substantial enhancements in the 4x branch, so I would suggest that anybody strongly favoring an imminent 5.0 release (next six months) should make their case more explicitly. Otherwise, it seems like we can continue to look at an ongoing stream of significant improvements to the 4x branch and that a 5.0 is probably at least a year or so off – or simply waiting on some major change that actually warrants a 5.0. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? Comments? -- Jack Krupansky
Re: 5.0 release status?
On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? I created a branch_5x because 3.x index support was responsible for multiple recent corruption bugs, some of which starting impacting 4.x indexes. Especially bad were: LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index. LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 index. LUCENE-5975: 3.x back compat code reports a false corruption (was indeed a bug in those versions of lucene) for 3.0-3.3 indexes. Whenever I see patterns in corruptions then I see it as a systemic problem and aggressively work to do something about it. I've seen several lately, but these are the relevant ones: 3.x back compat: 3.x didn't have a codec API, so its wedged in, and pretty hard. Its not that we were lazy, its that its radically different: doesn't separate data by fields, sorts terms differently, uses shared docstores, writes field numbers implicitly, ... We try to emulate it the best we can for testing, but the emulation can't really be perfect, so in such places: surprise, bugs. The only way to stop these corruptions is to stop supporting it. test infrastructure: IMO lucene 4 wasn't really ready to support multiple index formats from a test perspective, so we cheated and try to emulate old formats and rotate them across all tests. This works ok, but its horrible to debug (since these are essentially integration tests), the false failure rate is extremely high, and the complexity of the implementation is high. Its not just that it misses to find some bugs, it was actually directly responsible for corruption bugs like LUCENE-5377. But throughout 4.x, we have fixed the situation and added BaseXYZFormat tests for each part of an index format. Now we have reliable unit tests for each part of the abstract codec API: adding new tests here finds old bugs and prevents new ones in the future. For example I fixed several minor bugs in 4.x's CFS code just the last few days with this approach. there are also other patterns like deleting files, commit fallback logic, exception handling, addIndexes, etc that we have put substantial work into recently for 5.0. Whatever was safe to backport to bugfix releases, we tried, but some of these kinds of fixes are just too heavy for a bugfix branch, and many just cannot even be done as long as 3.x support exists. There is also some hardening in the 5.0 index format itself that really could not happen correctly as long as we must support 3.x. So its not just that 3.x causes corruption bugs, it prevents us from moving forward and actually tackling these other issues. This is important to do or we will just continue to tread water and not actually get ahead of them. So I did something about it and created a 5.x branch. Worse case, nobody would follow along, but I guess I just assumed the situation was widely understood. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? I don't see how this is relevant. The straw the broke the camel's back for me was LUCENE-5934, and it doesn't impact elasticsearch. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 5.0 release status?
Thanks for the further clarification. In short, the legacy of 3.x support was destabilizing 4.x itself (including testing), not just interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x will have less baggage holding it down than 4.x has today. I still need answers to: 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise upgrading their 4.x indexes to 5.0? 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 4.x indexes to 5.0 so that they will not have to worry or do anything when 6.0 comes out? -- Jack Krupansky -Original Message- From: Robert Muir Sent: Saturday, October 4, 2014 10:43 PM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? I created a branch_5x because 3.x index support was responsible for multiple recent corruption bugs, some of which starting impacting 4.x indexes. Especially bad were: LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index. LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 index. LUCENE-5975: 3.x back compat code reports a false corruption (was indeed a bug in those versions of lucene) for 3.0-3.3 indexes. Whenever I see patterns in corruptions then I see it as a systemic problem and aggressively work to do something about it. I've seen several lately, but these are the relevant ones: 3.x back compat: 3.x didn't have a codec API, so its wedged in, and pretty hard. Its not that we were lazy, its that its radically different: doesn't separate data by fields, sorts terms differently, uses shared docstores, writes field numbers implicitly, ... We try to emulate it the best we can for testing, but the emulation can't really be perfect, so in such places: surprise, bugs. The only way to stop these corruptions is to stop supporting it. test infrastructure: IMO lucene 4 wasn't really ready to support multiple index formats from a test perspective, so we cheated and try to emulate old formats and rotate them across all tests. This works ok, but its horrible to debug (since these are essentially integration tests), the false failure rate is extremely high, and the complexity of the implementation is high. Its not just that it misses to find some bugs, it was actually directly responsible for corruption bugs like LUCENE-5377. But throughout 4.x, we have fixed the situation and added BaseXYZFormat tests for each part of an index format. Now we have reliable unit tests for each part of the abstract codec API: adding new tests here finds old bugs and prevents new ones in the future. For example I fixed several minor bugs in 4.x's CFS code just the last few days with this approach. there are also other patterns like deleting files, commit fallback logic, exception handling, addIndexes, etc that we have put substantial work into recently for 5.0. Whatever was safe to backport to bugfix releases, we tried, but some of these kinds of fixes are just too heavy for a bugfix branch, and many just cannot even be done as long as 3.x support exists. There is also some hardening in the 5.0 index format itself that really could not happen correctly as long as we must support 3.x. So its not just that 3.x causes corruption bugs, it prevents us from moving forward and actually tackling these other issues. This is important to do or we will just continue to tread water and not actually get ahead of them. So I did something about it and created a 5.x branch. Worse case, nobody would follow along, but I guess I just assumed the situation was widely understood. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? I don't see how this is relevant. The straw the broke the camel's back for me was LUCENE-5934, and it doesn't impact elasticsearch. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 5.0 release status?
On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote: Thanks for the further clarification. In short, the legacy of 3.x support was destabilizing 4.x itself (including testing), not just interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x will have less baggage holding it down than 4.x has today. I still need answers to: 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise upgrading their 4.x indexes to 5.0? Yes, for all the reasons Robert already mentioned. 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 4.x indexes to 5.0 so that they will not have to worry or do anything when 6.0 comes out? Again, users should always upgrade if possible. There are improvements for memory and speed all the time. Currently they can use the IndexUpgrader (offline) or wrap there merge policy with UpgradeIndexMergePolicy (although both currently act like an optimize on the old segments, im hoping to change that soon). Ryan -- Jack Krupansky -Original Message- From: Robert Muir Sent: Saturday, October 4, 2014 10:43 PM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? I created a branch_5x because 3.x index support was responsible for multiple recent corruption bugs, some of which starting impacting 4.x indexes. Especially bad were: LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index. LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 index. LUCENE-5975: 3.x back compat code reports a false corruption (was indeed a bug in those versions of lucene) for 3.0-3.3 indexes. Whenever I see patterns in corruptions then I see it as a systemic problem and aggressively work to do something about it. I've seen several lately, but these are the relevant ones: 3.x back compat: 3.x didn't have a codec API, so its wedged in, and pretty hard. Its not that we were lazy, its that its radically different: doesn't separate data by fields, sorts terms differently, uses shared docstores, writes field numbers implicitly, ... We try to emulate it the best we can for testing, but the emulation can't really be perfect, so in such places: surprise, bugs. The only way to stop these corruptions is to stop supporting it. test infrastructure: IMO lucene 4 wasn't really ready to support multiple index formats from a test perspective, so we cheated and try to emulate old formats and rotate them across all tests. This works ok, but its horrible to debug (since these are essentially integration tests), the false failure rate is extremely high, and the complexity of the implementation is high. Its not just that it misses to find some bugs, it was actually directly responsible for corruption bugs like LUCENE-5377. But throughout 4.x, we have fixed the situation and added BaseXYZFormat tests for each part of an index format. Now we have reliable unit tests for each part of the abstract codec API: adding new tests here finds old bugs and prevents new ones in the future. For example I fixed several minor bugs in 4.x's CFS code just the last few days with this approach. there are also other patterns like deleting files, commit fallback logic, exception handling, addIndexes, etc that we have put substantial work into recently for 5.0. Whatever was safe to backport to bugfix releases, we tried, but some of these kinds of fixes are just too heavy for a bugfix branch, and many just cannot even be done as long as 3.x support exists. There is also some hardening in the 5.0 index format itself that really could not happen correctly as long as we must support 3.x. So its not just that 3.x causes corruption bugs, it prevents us from moving forward and actually tackling these other issues. This is important to do or we will just continue to tread water and not actually get ahead of them. So I did something about it and created a 5.x branch. Worse case, nobody would follow along, but I guess I just assumed the situation was widely understood. Open questions: What is Heliosearch up to, and what are Elasticsearch’s intentions? I don't see how this is relevant. The straw the broke the camel's back for me was LUCENE-5934
Re: 5.0 release status?
Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all 4.x indexes, or some, or under some conditions? I mean, I had the impression that it was only non-GA 4.0 indexes. And was it only 4.10 that was doing this, or 4.0 GA through 4.9 as well? In any case, I’m still not clear on the direct benefits to users of, say, 4.9 upgrading to 5.0 indexes. Any performance improvement? Any disk space reduction? Any RAM reduction? -- Jack Krupansky From: Ryan Ernst Sent: Sunday, October 5, 2014 12:24 AM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote: Thanks for the further clarification. In short, the legacy of 3.x support was destabilizing 4.x itself (including testing), not just interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x will have less baggage holding it down than 4.x has today. I still need answers to: 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise upgrading their 4.x indexes to 5.0? Yes, for all the reasons Robert already mentioned. 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 4.x indexes to 5.0 so that they will not have to worry or do anything when 6.0 comes out? Again, users should always upgrade if possible. There are improvements for memory and speed all the time. Currently they can use the IndexUpgrader (offline) or wrap there merge policy with UpgradeIndexMergePolicy (although both currently act like an optimize on the old segments, im hoping to change that soon). Ryan -- Jack Krupansky -Original Message- From: Robert Muir Sent: Saturday, October 4, 2014 10:43 PM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? I created a branch_5x because 3.x index support was responsible for multiple recent corruption bugs, some of which starting impacting 4.x indexes. Especially bad were: LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index. LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 index. LUCENE-5975: 3.x back compat code reports a false corruption (was indeed a bug in those versions of lucene) for 3.0-3.3 indexes. Whenever I see patterns in corruptions then I see it as a systemic problem and aggressively work to do something about it. I've seen several lately, but these are the relevant ones: 3.x back compat: 3.x didn't have a codec API, so its wedged in, and pretty hard. Its not that we were lazy, its that its radically different: doesn't separate data by fields, sorts terms differently, uses shared docstores, writes field numbers implicitly, ... We try to emulate it the best we can for testing, but the emulation can't really be perfect, so in such places: surprise, bugs. The only way to stop these corruptions is to stop supporting it. test infrastructure: IMO lucene 4 wasn't really ready to support multiple index formats from a test perspective, so we cheated and try to emulate old formats and rotate them across all tests. This works ok, but its horrible to debug (since these are essentially integration tests), the false failure rate is extremely high, and the complexity of the implementation is high. Its not just that it misses to find some bugs, it was actually directly responsible for corruption bugs like LUCENE-5377. But throughout 4.x, we have fixed the situation and added BaseXYZFormat tests for each part of an index format. Now we have reliable unit tests for each part of the abstract codec API: adding new tests here finds old bugs and prevents new ones in the future. For example I fixed several minor bugs in 4.x's CFS code just the last few days with this approach. there are also other patterns like deleting files, commit fallback logic, exception handling, addIndexes, etc that we have put substantial work into recently for 5.0. Whatever was safe to backport to bugfix releases, we tried, but some of these kinds of fixes are just too heavy for a bugfix branch, and many just cannot even be done as long as 3.x support exists. There is also some hardening in the 5.0 index format itself that really could not happen correctly as long
Re: 5.0 release status?
On Oct 4, 2014 9:35 PM, Jack Krupansky j...@basetechnology.com wrote: Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all 4.x indexes, or some, or under some conditions? I mean, I had the impression that it was only non-GA 4.0 indexes. And was it only 4.10 that was doing this, or 4.0 GA through 4.9 as well? The bug only affected people using the 4.10.0 release to read 4.0 beta/final segments (it thought they were 3x indexes). In any case, I’m still not clear on the direct benefits to users of, say, 4.9 upgrading to 5.0 indexes. Any performance improvement? Any disk space reduction? Any RAM reduction? Again, read through all the stuff Robert has mentioned, read through lucene/CHANGES.txt, read the issues that are currently open. Your previous comments have suggested users upgrading to 5.0 would only do so so they can eventually upgrade to 6.0, implying they wouldn't upgrade their indexes for minor releases. This simply is not the best advice. Look back at 4.9 and 4.10 for recent improvements in heap usage for doc values and norms for example. Going back farther, someone still on 4.0 doesn't benefit from the postings format improvements in 4.1. Users should upgrade their format whenever possible because improvements are always happening. -- Jack Krupansky From: Ryan Ernst Sent: Sunday, October 5, 2014 12:24 AM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote: Thanks for the further clarification. In short, the legacy of 3.x support was destabilizing 4.x itself (including testing), not just interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x will have less baggage holding it down than 4.x has today. I still need answers to: 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise upgrading their 4.x indexes to 5.0? Yes, for all the reasons Robert already mentioned. 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 4.x indexes to 5.0 so that they will not have to worry or do anything when 6.0 comes out? Again, users should always upgrade if possible. There are improvements for memory and speed all the time. Currently they can use the IndexUpgrader (offline) or wrap there merge policy with UpgradeIndexMergePolicy (although both currently act like an optimize on the old segments, im hoping to change that soon). Ryan -- Jack Krupansky -Original Message- From: Robert Muir Sent: Saturday, October 4, 2014 10:43 PM To: dev@lucene.apache.org Subject: Re: 5.0 release status? On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote: I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there was no explicit decision or even implication that a release 5.0 would be imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 6/branch 5x decision was more related to wanting to have a trunk that eliminated the 4x deprecations and was no longer constrained by compatibility with the 4x index – let me know if I am wrong about that in any way! But I did see a comment on one Jira referring to “preparation for a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely? I created a branch_5x because 3.x index support was responsible for multiple recent corruption bugs, some of which starting impacting 4.x indexes. Especially bad were: LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index. LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 index. LUCENE-5975: 3.x back compat code reports a false corruption (was indeed a bug in those versions of lucene) for 3.0-3.3 indexes. Whenever I see patterns in corruptions then I see it as a systemic problem and aggressively work to do something about it. I've seen several lately, but these are the relevant ones: 3.x back compat: 3.x didn't have a codec API, so its wedged in, and pretty hard. Its not that we were lazy, its that its radically different: doesn't separate data by fields, sorts terms differently, uses shared docstores, writes field numbers implicitly, ... We try to emulate it the best we can for testing, but the emulation can't really be perfect, so in such places: surprise, bugs. The only way to stop these corruptions is to stop supporting it. test infrastructure: IMO lucene 4 wasn't really ready to support multiple index formats from a test perspective, so we cheated and try to emulate old formats and rotate them across all tests. This works ok, but its horrible to debug (since these are essentially integration tests), the false failure rate is extremely high, and the complexity of the implementation is high. Its not just that it misses to find some bugs, it was actually directly