Re: 5.0 release status?

2014-10-05 Thread Jack Krupansky
To be clear, I myself am not trying to offer advice on whether or when people 
should upgrade – I’m trying solely to determine if there is significant value 
to do so, and what that value might be. I did indeed read through Robert’s list 
and have watched the Jira flow over the years, but I am unable to pinpoint 
“significant” improvements that will have more than just a “minor” impact for 
users. I’m not trying to say that significant improvements aren’t actually in 
there, just that I don’t know of any. If I am wrong, please provide the 
details. Like... are there use cases where the 5.0 index will be at least 10% 
faster or at least 10% smaller, and if so, which specific features and use 
cases? Or if there is a cumulative improvement in performance or capacity.

Or... if there are specific feature transitions to recommend that would result 
in dramatic improvements.

I mean, as things stand, there has been a lot of “shuffling around”, but no 
clear, quantified insight on the benefits of that shuffling/refactoring. I’m 
all for cleaner code (which can manifest as more reliable and less bugs), but 
is that is gist of most of the index changes?

In short, I’m more interested in the impact of the 5.0 index changes (and their 
use cases), not the details of the implementation of those changes.

Put another way, will a typical app be at least 10% faster or 10% smaller (or 
both!) when its index is converted from 4.x to 5.0? Or 5% or 20% or... whatever 
it actually is?

And if there are specific new features that rely on conversion to 5.0 index 
format, lets get that list collected as some bullet points. Call this 
preparation for the 5.0 release! Maybe it could be a summary section in the 5.0 
migration guide.

Clearly there is plenty of goodness in the 5.0 work, but I’m just trying to get 
a handle on the overall impact.

-- Jack Krupansky

From: Ryan Ernst 
Sent: Sunday, October 5, 2014 12:48 AM
To: dev@lucene.apache.org 
Subject: Re: 5.0 release status?


On Oct 4, 2014 9:35 PM, Jack Krupansky j...@basetechnology.com wrote:

 Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all 4.x 
 indexes, or some, or under some conditions? I mean, I had the impression that 
 it was only non-GA 4.0 indexes. And was it only 4.10 that was doing this, or 
 4.0 GA through 4.9 as well?

The bug only affected people using the 4.10.0 release to read 4.0 beta/final 
segments (it thought they were 3x indexes).

  
 In any case, I’m still not clear on the direct benefits to users of, say, 4.9 
 upgrading to 5.0 indexes. Any performance improvement? Any disk space 
 reduction? Any RAM reduction?

Again, read through all the stuff Robert has mentioned, read through 
lucene/CHANGES.txt, read the issues that are currently open. Your previous 
comments have suggested users upgrading to 5.0 would only do so so they can 
eventually upgrade to 6.0, implying they wouldn't upgrade their indexes for 
minor releases. This simply is not the best advice. Look back at 4.9 and 4.10 
for recent improvements in heap usage for doc values and norms for example. 
Going back farther, someone still on 4.0 doesn't benefit from the postings 
format improvements in 4.1. Users should upgrade their format whenever possible 
because improvements are always happening.

  
 -- Jack Krupansky
  
 From: Ryan Ernst
 Sent: Sunday, October 5, 2014 12:24 AM
 To: dev@lucene.apache.org
 Subject: Re: 5.0 release status?
  


 On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote:
 
  Thanks for the further clarification. In short, the legacy of 3.x support 
  was destabilizing 4.x itself (including testing), not just interfering with 
  6.x moving forward beyond 3.x index compatibility. So, 5.x will have less 
  baggage holding it down than 4.x has today.
 
  I still need answers to:
 
  1. Will users of 5.0 get any immediate benefit by reindexing or otherwise 
  upgrading their 4.x indexes to 5.0?

 Yes, for all the reasons Robert already mentioned.

 
  2. What is the easiest, most efficient way for users of 5.0 to upgrade 
  their 4.x indexes to 5.0 so that they will not have to worry or do anything 
  when 6.0 comes out?

 Again, users should always upgrade if possible. There are improvements for 
 memory and speed all the time. Currently they can use the IndexUpgrader 
 (offline) or wrap there merge policy with UpgradeIndexMergePolicy (although 
 both currently act like an optimize on the old segments, im hoping to change 
 that soon).

 Ryan

 
  -- Jack Krupansky
 
  -Original Message- From: Robert Muir
  Sent: Saturday, October 4, 2014 10:43 PM
 
  To: dev@lucene.apache.org
  Subject: Re: 5.0 release status?
 
  On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com 
  wrote:
 
  I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT
  there was no explicit decision or even implication that a release 5.0 would
  be imminent or that there would not be a 4.11 release. AFAICT

Re: 5.0 release status?

2014-10-05 Thread Robert Muir
Why don't you just do as ryan suggests and read the JIRA issues. I
already outlined things like the memory improvements being made in the
new format and for merging. I don't need to summarize it again.

And we don't have to justify fixing back compat corruptions with new
features. It is the other way around. If anyone doesn't like this
approach to unfucking these problems for 5.0 and wants to continue
with 4.x releases, then they need to step up to the plate and do the
work.

Thats where the problem always is: lots of people that want to whine
about back compat, but don't want to actually do any work.

On Sun, Oct 5, 2014 at 9:30 AM, Jack Krupansky j...@basetechnology.com wrote:
 To be clear, I myself am not trying to offer advice on whether or when
 people should upgrade – I’m trying solely to determine if there is
 significant value to do so, and what that value might be. I did indeed read
 through Robert’s list and have watched the Jira flow over the years, but I
 am unable to pinpoint “significant” improvements that will have more than
 just a “minor” impact for users. I’m not trying to say that significant
 improvements aren’t actually in there, just that I don’t know of any. If I
 am wrong, please provide the details. Like... are there use cases where the
 5.0 index will be at least 10% faster or at least 10% smaller, and if so,
 which specific features and use cases? Or if there is a cumulative
 improvement in performance or capacity.

 Or... if there are specific feature transitions to recommend that would
 result in dramatic improvements.

 I mean, as things stand, there has been a lot of “shuffling around”, but no
 clear, quantified insight on the benefits of that shuffling/refactoring. I’m
 all for cleaner code (which can manifest as more reliable and less bugs),
 but is that is gist of most of the index changes?

 In short, I’m more interested in the impact of the 5.0 index changes (and
 their use cases), not the details of the implementation of those changes.

 Put another way, will a typical app be at least 10% faster or 10% smaller
 (or both!) when its index is converted from 4.x to 5.0? Or 5% or 20% or...
 whatever it actually is?

 And if there are specific new features that rely on conversion to 5.0 index
 format, lets get that list collected as some bullet points. Call this
 preparation for the 5.0 release! Maybe it could be a summary section in the
 5.0 migration guide.

 Clearly there is plenty of goodness in the 5.0 work, but I’m just trying to
 get a handle on the overall impact.

 -- Jack Krupansky

 From: Ryan Ernst
 Sent: Sunday, October 5, 2014 12:48 AM
 To: dev@lucene.apache.org
 Subject: Re: 5.0 release status?



 On Oct 4, 2014 9:35 PM, Jack Krupansky j...@basetechnology.com wrote:

 Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all
 4.x indexes, or some, or under some conditions? I mean, I had the impression
 that it was only non-GA 4.0 indexes. And was it only 4.10 that was doing
 this, or 4.0 GA through 4.9 as well?

 The bug only affected people using the 4.10.0 release to read 4.0 beta/final
 segments (it thought they were 3x indexes).


 In any case, I’m still not clear on the direct benefits to users of, say,
 4.9 upgrading to 5.0 indexes. Any performance improvement? Any disk space
 reduction? Any RAM reduction?

 Again, read through all the stuff Robert has mentioned, read through
 lucene/CHANGES.txt, read the issues that are currently open. Your previous
 comments have suggested users upgrading to 5.0 would only do so so they can
 eventually upgrade to 6.0, implying they wouldn't upgrade their indexes for
 minor releases. This simply is not the best advice. Look back at 4.9 and
 4.10 for recent improvements in heap usage for doc values and norms for
 example. Going back farther, someone still on 4.0 doesn't benefit from the
 postings format improvements in 4.1. Users should upgrade their format
 whenever possible because improvements are always happening.


 -- Jack Krupansky

 From: Ryan Ernst
 Sent: Sunday, October 5, 2014 12:24 AM
 To: dev@lucene.apache.org
 Subject: Re: 5.0 release status?



 On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote:
 
  Thanks for the further clarification. In short, the legacy of 3.x
  support was destabilizing 4.x itself (including testing), not just
  interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x
  will have less baggage holding it down than 4.x has today.
 
  I still need answers to:
 
  1. Will users of 5.0 get any immediate benefit by reindexing or
  otherwise upgrading their 4.x indexes to 5.0?

 Yes, for all the reasons Robert already mentioned.

 
  2. What is the easiest, most efficient way for users of 5.0 to upgrade
  their 4.x indexes to 5.0 so that they will not have to worry or do anything
  when 6.0 comes out?

 Again, users should always upgrade if possible. There are improvements for
 memory and speed all the time. Currently they can use

5.0 release status?

2014-10-04 Thread Jack Krupansky
I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT there 
was no explicit decision or even implication that a release 5.0 would be 
imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 
6/branch 5x decision was more related to wanting to have a trunk that 
eliminated the 4x deprecations and was no longer constrained by compatibility 
with the 4x index – let me know if I am wrong about that in any way! But I did 
see a comment on one Jira referring to “preparation for a 5.0 release”, so I 
wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 
4.11, 4.12, 4.13... equally likely?

AFAICT, there isn’t anything super major in 5x that the world is super-urgently 
waiting for (WAR vs. server?), and people have been really good at making 
substantial enhancements in the 4x branch, so I would suggest that anybody 
strongly favoring an imminent 5.0 release (next six months) should make their 
case more explicitly. Otherwise, it seems like we can continue to look at an 
ongoing stream of significant improvements to the 4x branch and that a 5.0 is 
probably at least a year or so off – or simply waiting on some major change 
that actually warrants a 5.0.

Open questions: What is Heliosearch up to, and what are Elasticsearch’s 
intentions?

Comments?

-- Jack Krupansky

Re: 5.0 release status?

2014-10-04 Thread Shawn Heisey
On 10/4/2014 10:35 AM, Jack Krupansky wrote:
 I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT
 there was no explicit decision or even implication that a release 5.0
 would be imminent or that there would not be a 4.11 release. AFAICT, the
 whole trunk 6/branch 5x decision was more related to wanting to have a
 trunk that eliminated the 4x deprecations and was no longer constrained
 by compatibility with the 4x index – let me know if I am wrong about
 that in any way! But I did see a comment on one Jira referring to
 “preparation for a 5.0 release”, so I wanted to inquire about
 intentions. So, is a 5.0 release “coming soon”, or are 4.11, 4.12,
 4.13... equally likely?
  
 AFAICT, there isn’t anything super major in 5x that the world is
 super-urgently waiting for (WAR vs. server?), and people have been
 really good at making substantial enhancements in the 4x branch, so I
 would suggest that anybody strongly favoring an imminent 5.0 release
 (next six months) should make their case more explicitly. Otherwise, it
 seems like we can continue to look at an ongoing stream of significant
 improvements to the 4x branch and that a 5.0 is probably at least a year
 or so off – or simply waiting on some major change that actually
 warrants a 5.0.
  
 Open questions: What is Heliosearch up to, and what are Elasticsearch’s
 intentions?

I think you're right when you say that freeing trunk from compatibility
hell is a primary goal.

In SVN, branch_4x has been eliminated and branch_5x now exists.  We took
a roundabout path -- if I grok it correctly, branch_4x was renamed to
branch_5x and large-scale code changes were backported from trunk.  That
must have been quite a job, so many thanks to Robert for that effort.

I think that any further 4.x releases will only be point releases for
bugfixes on 4.10.  We currently don't have an easy way to build a new
4.x release, so the next feature release will be 5.0.

At this moment, branch_5x builds a war, not a server application.  I'm
still interested in changing that, and I believe that is the plan, but
as far as I know, no actual work has been done on the transition.  That
work is likely to take a while to become stable, so a timely 5.0 release
required restoring the war to 5x.

I am fairly sure the work for a standalone Solr server will happen on
trunk, and if the changes aren't extraordinarily drastic, we can port
the alternate build target to 5.x, and make it the default build target
in a later release.  Since 5.0 will still build a .war file, we probably
need to make a servlet version available for all 5.x releases.  Stay
tuned for info on how that gets managed, because I have no idea. :)
Perhaps breaking up the download into smaller bits can happen on the 5x
branch.

What I've seen from Heliosearch looks really awesome, though I haven't
actually tried it yet.  I'd like to see where that goes.  GC pauses can
be a big problem, so reducing the amount of memory that requires GC is a
great goal.  For elasticsearch, I have zero information.

We probably won't get 5.0 out the door before the end of the year, but
it would be awesome if we did.  Hopefully it won't take six months,
though that wouldn't surprise me.  I'm doing what I can for the cause,
by running a larger test suite than normal.  We've got some insane
resource requirements for some of our non-default tests!  The @Monster
designation is fitting.

Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 5.0 release status?

2014-10-04 Thread Ryan Ernst
The branch_5x effort is to release what would have been 4.11 as 5.0.  The
most notable reason being backcompat for 3x indexes, which as Robert has
put it is unmaintainable.

AFAICT, there isn’t anything super major in 5x that the world is
 super-urgently waiting for (WAR vs. server?)


The WAR removal was not backported to 5x.  It is still on trunk, to be
dealt with at a later time.

 Otherwise, it seems like we can continue to look at an ongoing stream of
 significant improvements to the 4x branch and that a 5.0 is probably at
 least a year or so off


I don't believe this is correct.  The intent here is to have the next
release of Lucene be 5.0.  Robert has put in a great deal of effort in
making improvements in a new Lucene50 codec that were simply not possible
on 4x.

or simply waiting on some major change that actually warrants a 5.0.


There are already some major changes in 5.0: nio2, tons more index
corruption protection, super improved debugging for memory allocation of
index structures, simpler tokenizer/analyzer interface without Reader, ram
usage improvements with the 50 codec work so far.

I know I have a list of things I'd like to do API-wise. IMO, a few months,
maybe more.

On Sat, Oct 4, 2014 at 9:35 AM, Jack Krupansky j...@basetechnology.com
wrote:

   I tried to follow all of the trunk 6/branch 5x discussion, but...
 AFAICT there was no explicit decision or even implication that a release
 5.0 would be imminent or that there would not be a 4.11 release. AFAICT,
 the whole trunk 6/branch 5x decision was more related to wanting to have a
 trunk that eliminated the 4x deprecations and was no longer constrained by
 compatibility with the 4x index – let me know if I am wrong about that in
 any way! But I did see a comment on one Jira referring to “preparation for
 a 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0
 release “coming soon”, or are 4.11, 4.12, 4.13... equally likely?

 AFAICT, there isn’t anything super major in 5x that the world is
 super-urgently waiting for (WAR vs. server?), and people have been really
 good at making substantial enhancements in the 4x branch, so I would
 suggest that anybody strongly favoring an imminent 5.0 release (next six
 months) should make their case more explicitly. Otherwise, it seems like we
 can continue to look at an ongoing stream of significant improvements to
 the 4x branch and that a 5.0 is probably at least a year or so off – or
 simply waiting on some major change that actually warrants a 5.0.

 Open questions: What is Heliosearch up to, and what are Elasticsearch’s
 intentions?

 Comments?

 -- Jack Krupansky



Re: 5.0 release status?

2014-10-04 Thread Jack Krupansky
Thanks for the clarification! I do indeed recall now that portion of the 
discussion about renaming of branch_4x to branch_5x with a lot/most of what had 
previously been trunk, with the most major exception being the trunk war/server 
changes.

To make the long story short, the next non-patch release of Lucene and Solr 
will be... 5.0, not 4.11. So, 5.0 should be out, like within the next couple of 
months.

In terms of the impact on anybody for compatibility, the only big thing is that 
5.0 will not support 3.x indexes. It will fully support the 4.x indexes though, 
correct? Will there be any benefit or reason for people to upgrade their 4.x 
indexes to 5.0? One reason I can think of is so that they will be able to jump 
from 5.x to 6.0, otherwise 6.0 would refuse to accept their 4.x indexes. Can a 
4.x index be easily upgraded to be a 5.x index, like using a utility or 
optimize?

Do I have everything straight now?

-- Jack Krupansky

From: Ryan Ernst 
Sent: Saturday, October 4, 2014 3:57 PM
To: dev@lucene.apache.org 
Subject: Re: 5.0 release status?

The branch_5x effort is to release what would have been 4.11 as 5.0.  The most 
notable reason being backcompat for 3x indexes, which as Robert has put it is 
unmaintainable.   

  AFAICT, there isn’t anything super major in 5x that the world is 
super-urgently waiting for (WAR vs. server?)

The WAR removal was not backported to 5x.  It is still on trunk, to be dealt 
with at a later time.

   Otherwise, it seems like we can continue to look at an ongoing stream of 
significant improvements to the 4x branch and that a 5.0 is probably at least a 
year or so off

I don't believe this is correct.  The intent here is to have the next release 
of Lucene be 5.0.  Robert has put in a great deal of effort in making 
improvements in a new Lucene50 codec that were simply not possible on 4x.

  or simply waiting on some major change that actually warrants a 5.0.

There are already some major changes in 5.0: nio2, tons more index corruption 
protection, super improved debugging for memory allocation of index structures, 
simpler tokenizer/analyzer interface without Reader, ram usage improvements 
with the 50 codec work so far. 

I know I have a list of things I'd like to do API-wise. IMO, a few months, 
maybe more. 

On Sat, Oct 4, 2014 at 9:35 AM, Jack Krupansky j...@basetechnology.com wrote:

  I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT 
there was no explicit decision or even implication that a release 5.0 would be 
imminent or that there would not be a 4.11 release. AFAICT, the whole trunk 
6/branch 5x decision was more related to wanting to have a trunk that 
eliminated the 4x deprecations and was no longer constrained by compatibility 
with the 4x index – let me know if I am wrong about that in any way! But I did 
see a comment on one Jira referring to “preparation for a 5.0 release”, so I 
wanted to inquire about intentions. So, is a 5.0 release “coming soon”, or are 
4.11, 4.12, 4.13... equally likely?

  AFAICT, there isn’t anything super major in 5x that the world is 
super-urgently waiting for (WAR vs. server?), and people have been really good 
at making substantial enhancements in the 4x branch, so I would suggest that 
anybody strongly favoring an imminent 5.0 release (next six months) should make 
their case more explicitly. Otherwise, it seems like we can continue to look at 
an ongoing stream of significant improvements to the 4x branch and that a 5.0 
is probably at least a year or so off – or simply waiting on some major change 
that actually warrants a 5.0.

  Open questions: What is Heliosearch up to, and what are Elasticsearch’s 
intentions?

  Comments?

  -- Jack Krupansky


Re: 5.0 release status?

2014-10-04 Thread Robert Muir
On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com wrote:
 I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT
 there was no explicit decision or even implication that a release 5.0 would
 be imminent or that there would not be a 4.11 release. AFAICT, the whole
 trunk 6/branch 5x decision was more related to wanting to have a trunk that
 eliminated the 4x deprecations and was no longer constrained by
 compatibility with the 4x index – let me know if I am wrong about that in
 any way! But I did see a comment on one Jira referring to “preparation for a
 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release
 “coming soon”, or are 4.11, 4.12, 4.13... equally likely?

I created a branch_5x because 3.x index support was responsible for
multiple recent corruption bugs, some of which starting impacting 4.x
indexes.

Especially bad were:
LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index.
LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 index.
LUCENE-5975: 3.x back compat code reports a false corruption (was
indeed a bug in those versions of lucene) for 3.0-3.3 indexes.

Whenever I see patterns in corruptions then I see it as a systemic
problem and aggressively work to do something about it. I've seen
several lately, but these are the relevant ones:

3.x back compat: 3.x didn't have a codec API, so its wedged in, and
pretty hard. Its not that we were lazy, its that its radically
different: doesn't separate data by fields, sorts terms differently,
uses shared docstores, writes field numbers implicitly, ... We try to
emulate it the best we can for testing, but the emulation can't really
be perfect, so in such places: surprise, bugs. The only way to stop
these corruptions is to stop supporting it.

test infrastructure: IMO lucene 4 wasn't really ready to support
multiple index formats from a test perspective, so we cheated and try
to emulate old formats and rotate them across all tests. This works
ok, but its horrible to debug (since
these are essentially integration tests), the false failure rate is
extremely high, and the complexity of the implementation is high. Its
not just that it misses to find some bugs, it was actually directly
responsible for corruption bugs like LUCENE-5377. But throughout 4.x,
we have fixed the situation and added BaseXYZFormat tests for each
part of an index format. Now we have reliable unit tests for each part
of the abstract codec API: adding new tests here finds old bugs and
prevents new ones in the future. For example I fixed several minor
bugs in 4.x's CFS code just the last few days with this approach.

there are also other patterns like deleting files, commit fallback
logic, exception handling, addIndexes, etc that we have put
substantial work into recently for 5.0. Whatever was safe to backport
to bugfix releases, we tried, but some of these kinds of fixes are
just too heavy for a bugfix branch, and many just cannot even be done
as long as 3.x support exists. There is also some hardening in the 5.0
index format itself that really could not happen correctly as long as
we must support 3.x.

So its not just that 3.x causes corruption bugs, it prevents us from
moving forward and actually tackling these other issues. This is
important to do or we will just continue to tread water and not
actually get ahead of them. So I did something about it and created a
5.x branch. Worse case, nobody would follow along, but I guess I just
assumed the situation was widely understood.


 Open questions: What is Heliosearch up to, and what are Elasticsearch’s
 intentions?


I don't see how this is relevant. The straw the broke the camel's back
for me was LUCENE-5934, and it doesn't impact elasticsearch.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 5.0 release status?

2014-10-04 Thread Jack Krupansky
Thanks for the further clarification. In short, the legacy of 3.x support 
was destabilizing 4.x itself (including testing), not just interfering with 
6.x moving forward beyond 3.x index compatibility. So, 5.x will have less 
baggage holding it down than 4.x has today.


I still need answers to:

1. Will users of 5.0 get any immediate benefit by reindexing or otherwise 
upgrading their 4.x indexes to 5.0?


2. What is the easiest, most efficient way for users of 5.0 to upgrade their 
4.x indexes to 5.0 so that they will not have to worry or do anything when 
6.0 comes out?


-- Jack Krupansky

-Original Message- 
From: Robert Muir

Sent: Saturday, October 4, 2014 10:43 PM
To: dev@lucene.apache.org
Subject: Re: 5.0 release status?

On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com 
wrote:

I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT
there was no explicit decision or even implication that a release 5.0 
would

be imminent or that there would not be a 4.11 release. AFAICT, the whole
trunk 6/branch 5x decision was more related to wanting to have a trunk 
that

eliminated the 4x deprecations and was no longer constrained by
compatibility with the 4x index – let me know if I am wrong about that in
any way! But I did see a comment on one Jira referring to “preparation for 
a
5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 
release

“coming soon”, or are 4.11, 4.12, 4.13... equally likely?


I created a branch_5x because 3.x index support was responsible for
multiple recent corruption bugs, some of which starting impacting 4.x
indexes.

Especially bad were:
LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index.
LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 
index.

LUCENE-5975: 3.x back compat code reports a false corruption (was
indeed a bug in those versions of lucene) for 3.0-3.3 indexes.

Whenever I see patterns in corruptions then I see it as a systemic
problem and aggressively work to do something about it. I've seen
several lately, but these are the relevant ones:

3.x back compat: 3.x didn't have a codec API, so its wedged in, and
pretty hard. Its not that we were lazy, its that its radically
different: doesn't separate data by fields, sorts terms differently,
uses shared docstores, writes field numbers implicitly, ... We try to
emulate it the best we can for testing, but the emulation can't really
be perfect, so in such places: surprise, bugs. The only way to stop
these corruptions is to stop supporting it.

test infrastructure: IMO lucene 4 wasn't really ready to support
multiple index formats from a test perspective, so we cheated and try
to emulate old formats and rotate them across all tests. This works
ok, but its horrible to debug (since
these are essentially integration tests), the false failure rate is
extremely high, and the complexity of the implementation is high. Its
not just that it misses to find some bugs, it was actually directly
responsible for corruption bugs like LUCENE-5377. But throughout 4.x,
we have fixed the situation and added BaseXYZFormat tests for each
part of an index format. Now we have reliable unit tests for each part
of the abstract codec API: adding new tests here finds old bugs and
prevents new ones in the future. For example I fixed several minor
bugs in 4.x's CFS code just the last few days with this approach.

there are also other patterns like deleting files, commit fallback
logic, exception handling, addIndexes, etc that we have put
substantial work into recently for 5.0. Whatever was safe to backport
to bugfix releases, we tried, but some of these kinds of fixes are
just too heavy for a bugfix branch, and many just cannot even be done
as long as 3.x support exists. There is also some hardening in the 5.0
index format itself that really could not happen correctly as long as
we must support 3.x.

So its not just that 3.x causes corruption bugs, it prevents us from
moving forward and actually tackling these other issues. This is
important to do or we will just continue to tread water and not
actually get ahead of them. So I did something about it and created a
5.x branch. Worse case, nobody would follow along, but I guess I just
assumed the situation was widely understood.



Open questions: What is Heliosearch up to, and what are Elasticsearch’s
intentions?



I don't see how this is relevant. The straw the broke the camel's back
for me was LUCENE-5934, and it doesn't impact elasticsearch.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 5.0 release status?

2014-10-04 Thread Ryan Ernst
On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote:

 Thanks for the further clarification. In short, the legacy of 3.x support
was destabilizing 4.x itself (including testing), not just interfering with
6.x moving forward beyond 3.x index compatibility. So, 5.x will have less
baggage holding it down than 4.x has today.

 I still need answers to:

 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise
upgrading their 4.x indexes to 5.0?

Yes, for all the reasons Robert already mentioned.


 2. What is the easiest, most efficient way for users of 5.0 to upgrade
their 4.x indexes to 5.0 so that they will not have to worry or do anything
when 6.0 comes out?

Again, users should always upgrade if possible. There are improvements for
memory and speed all the time. Currently they can use the IndexUpgrader
(offline) or wrap there merge policy with UpgradeIndexMergePolicy (although
both currently act like an optimize on the old segments, im hoping to
change that soon).

Ryan


 -- Jack Krupansky

 -Original Message- From: Robert Muir
 Sent: Saturday, October 4, 2014 10:43 PM

 To: dev@lucene.apache.org
 Subject: Re: 5.0 release status?

 On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com
wrote:

 I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT
 there was no explicit decision or even implication that a release 5.0
would
 be imminent or that there would not be a 4.11 release. AFAICT, the whole
 trunk 6/branch 5x decision was more related to wanting to have a trunk
that
 eliminated the 4x deprecations and was no longer constrained by
 compatibility with the 4x index – let me know if I am wrong about that in
 any way! But I did see a comment on one Jira referring to “preparation
for a
 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0
release
 “coming soon”, or are 4.11, 4.12, 4.13... equally likely?


 I created a branch_5x because 3.x index support was responsible for
 multiple recent corruption bugs, some of which starting impacting 4.x
 indexes.

 Especially bad were:
 LUCENE-5907: 3.x back compat code corrupts (not just can't read) your
index.
 LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0
index.
 LUCENE-5975: 3.x back compat code reports a false corruption (was
 indeed a bug in those versions of lucene) for 3.0-3.3 indexes.

 Whenever I see patterns in corruptions then I see it as a systemic
 problem and aggressively work to do something about it. I've seen
 several lately, but these are the relevant ones:

 3.x back compat: 3.x didn't have a codec API, so its wedged in, and
 pretty hard. Its not that we were lazy, its that its radically
 different: doesn't separate data by fields, sorts terms differently,
 uses shared docstores, writes field numbers implicitly, ... We try to
 emulate it the best we can for testing, but the emulation can't really
 be perfect, so in such places: surprise, bugs. The only way to stop
 these corruptions is to stop supporting it.

 test infrastructure: IMO lucene 4 wasn't really ready to support
 multiple index formats from a test perspective, so we cheated and try
 to emulate old formats and rotate them across all tests. This works
 ok, but its horrible to debug (since
 these are essentially integration tests), the false failure rate is
 extremely high, and the complexity of the implementation is high. Its
 not just that it misses to find some bugs, it was actually directly
 responsible for corruption bugs like LUCENE-5377. But throughout 4.x,
 we have fixed the situation and added BaseXYZFormat tests for each
 part of an index format. Now we have reliable unit tests for each part
 of the abstract codec API: adding new tests here finds old bugs and
 prevents new ones in the future. For example I fixed several minor
 bugs in 4.x's CFS code just the last few days with this approach.

 there are also other patterns like deleting files, commit fallback
 logic, exception handling, addIndexes, etc that we have put
 substantial work into recently for 5.0. Whatever was safe to backport
 to bugfix releases, we tried, but some of these kinds of fixes are
 just too heavy for a bugfix branch, and many just cannot even be done
 as long as 3.x support exists. There is also some hardening in the 5.0
 index format itself that really could not happen correctly as long as
 we must support 3.x.

 So its not just that 3.x causes corruption bugs, it prevents us from
 moving forward and actually tackling these other issues. This is
 important to do or we will just continue to tread water and not
 actually get ahead of them. So I did something about it and created a
 5.x branch. Worse case, nobody would follow along, but I guess I just
 assumed the situation was widely understood.


 Open questions: What is Heliosearch up to, and what are Elasticsearch’s
 intentions?


 I don't see how this is relevant. The straw the broke the camel's back
 for me was LUCENE-5934

Re: 5.0 release status?

2014-10-04 Thread Jack Krupansky
Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all 4.x 
indexes, or some, or under some conditions? I mean, I had the impression that 
it was only non-GA 4.0 indexes. And was it only 4.10 that was doing this, or 
4.0 GA through 4.9 as well?

In any case, I’m still not clear on the direct benefits to users of, say, 4.9 
upgrading to 5.0 indexes. Any performance improvement? Any disk space 
reduction? Any RAM reduction?

-- Jack Krupansky

From: Ryan Ernst 
Sent: Sunday, October 5, 2014 12:24 AM
To: dev@lucene.apache.org 
Subject: Re: 5.0 release status?


On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote:

 Thanks for the further clarification. In short, the legacy of 3.x support was 
 destabilizing 4.x itself (including testing), not just interfering with 6.x 
 moving forward beyond 3.x index compatibility. So, 5.x will have less baggage 
 holding it down than 4.x has today.

 I still need answers to:

 1. Will users of 5.0 get any immediate benefit by reindexing or otherwise 
 upgrading their 4.x indexes to 5.0?

Yes, for all the reasons Robert already mentioned.


 2. What is the easiest, most efficient way for users of 5.0 to upgrade their 
 4.x indexes to 5.0 so that they will not have to worry or do anything when 
 6.0 comes out?

Again, users should always upgrade if possible. There are improvements for 
memory and speed all the time. Currently they can use the IndexUpgrader 
(offline) or wrap there merge policy with UpgradeIndexMergePolicy (although 
both currently act like an optimize on the old segments, im hoping to change 
that soon).

Ryan


 -- Jack Krupansky

 -Original Message- From: Robert Muir
 Sent: Saturday, October 4, 2014 10:43 PM

 To: dev@lucene.apache.org
 Subject: Re: 5.0 release status?

 On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com 
 wrote:

 I tried to follow all of the trunk 6/branch 5x discussion, but... AFAICT
 there was no explicit decision or even implication that a release 5.0 would
 be imminent or that there would not be a 4.11 release. AFAICT, the whole
 trunk 6/branch 5x decision was more related to wanting to have a trunk that
 eliminated the 4x deprecations and was no longer constrained by
 compatibility with the 4x index – let me know if I am wrong about that in
 any way! But I did see a comment on one Jira referring to “preparation for a
 5.0 release”, so I wanted to inquire about intentions. So, is a 5.0 release
 “coming soon”, or are 4.11, 4.12, 4.13... equally likely?


 I created a branch_5x because 3.x index support was responsible for
 multiple recent corruption bugs, some of which starting impacting 4.x
 indexes.

 Especially bad were:
 LUCENE-5907: 3.x back compat code corrupts (not just can't read) your index.
 LUCENE-5934: 3.x back compat code corrupts (not just can't read) your 4.0 
 index.
 LUCENE-5975: 3.x back compat code reports a false corruption (was
 indeed a bug in those versions of lucene) for 3.0-3.3 indexes.

 Whenever I see patterns in corruptions then I see it as a systemic
 problem and aggressively work to do something about it. I've seen
 several lately, but these are the relevant ones:

 3.x back compat: 3.x didn't have a codec API, so its wedged in, and
 pretty hard. Its not that we were lazy, its that its radically
 different: doesn't separate data by fields, sorts terms differently,
 uses shared docstores, writes field numbers implicitly, ... We try to
 emulate it the best we can for testing, but the emulation can't really
 be perfect, so in such places: surprise, bugs. The only way to stop
 these corruptions is to stop supporting it.

 test infrastructure: IMO lucene 4 wasn't really ready to support
 multiple index formats from a test perspective, so we cheated and try
 to emulate old formats and rotate them across all tests. This works
 ok, but its horrible to debug (since
 these are essentially integration tests), the false failure rate is
 extremely high, and the complexity of the implementation is high. Its
 not just that it misses to find some bugs, it was actually directly
 responsible for corruption bugs like LUCENE-5377. But throughout 4.x,
 we have fixed the situation and added BaseXYZFormat tests for each
 part of an index format. Now we have reliable unit tests for each part
 of the abstract codec API: adding new tests here finds old bugs and
 prevents new ones in the future. For example I fixed several minor
 bugs in 4.x's CFS code just the last few days with this approach.

 there are also other patterns like deleting files, commit fallback
 logic, exception handling, addIndexes, etc that we have put
 substantial work into recently for 5.0. Whatever was safe to backport
 to bugfix releases, we tried, but some of these kinds of fixes are
 just too heavy for a bugfix branch, and many just cannot even be done
 as long as 3.x support exists. There is also some hardening in the 5.0
 index format itself that really could not happen correctly as long

Re: 5.0 release status?

2014-10-04 Thread Ryan Ernst
On Oct 4, 2014 9:35 PM, Jack Krupansky j...@basetechnology.com wrote:

 Maybe I just can’t fully make sense of LUCENE-5934 – does it corrupt all
4.x indexes, or some, or under some conditions? I mean, I had the
impression that it was only non-GA 4.0 indexes. And was it only 4.10 that
was doing this, or 4.0 GA through 4.9 as well?

The bug only affected people using the 4.10.0 release to read 4.0
beta/final segments (it thought they were 3x indexes).


 In any case, I’m still not clear on the direct benefits to users of, say,
4.9 upgrading to 5.0 indexes. Any performance improvement? Any disk space
reduction? Any RAM reduction?

Again, read through all the stuff Robert has mentioned, read through
lucene/CHANGES.txt, read the issues that are currently open. Your previous
comments have suggested users upgrading to 5.0 would only do so so they can
eventually upgrade to 6.0, implying they wouldn't upgrade their indexes for
minor releases. This simply is not the best advice. Look back at 4.9 and
4.10 for recent improvements in heap usage for doc values and norms for
example. Going back farther, someone still on 4.0 doesn't benefit from the
postings format improvements in 4.1. Users should upgrade their format
whenever possible because improvements are always happening.


 -- Jack Krupansky

 From: Ryan Ernst
 Sent: Sunday, October 5, 2014 12:24 AM
 To: dev@lucene.apache.org
 Subject: Re: 5.0 release status?



 On Oct 4, 2014 9:13 PM, Jack Krupansky j...@basetechnology.com wrote:
 
  Thanks for the further clarification. In short, the legacy of 3.x
support was destabilizing 4.x itself (including testing), not just
interfering with 6.x moving forward beyond 3.x index compatibility. So, 5.x
will have less baggage holding it down than 4.x has today.
 
  I still need answers to:
 
  1. Will users of 5.0 get any immediate benefit by reindexing or
otherwise upgrading their 4.x indexes to 5.0?

 Yes, for all the reasons Robert already mentioned.

 
  2. What is the easiest, most efficient way for users of 5.0 to upgrade
their 4.x indexes to 5.0 so that they will not have to worry or do anything
when 6.0 comes out?

 Again, users should always upgrade if possible. There are improvements
for memory and speed all the time. Currently they can use the IndexUpgrader
(offline) or wrap there merge policy with UpgradeIndexMergePolicy (although
both currently act like an optimize on the old segments, im hoping to
change that soon).

 Ryan

 
  -- Jack Krupansky
 
  -Original Message- From: Robert Muir
  Sent: Saturday, October 4, 2014 10:43 PM
 
  To: dev@lucene.apache.org
  Subject: Re: 5.0 release status?
 
  On Sat, Oct 4, 2014 at 12:35 PM, Jack Krupansky j...@basetechnology.com
wrote:
 
  I tried to follow all of the trunk 6/branch 5x discussion, but...
AFAICT
  there was no explicit decision or even implication that a release 5.0
would
  be imminent or that there would not be a 4.11 release. AFAICT, the
whole
  trunk 6/branch 5x decision was more related to wanting to have a trunk
that
  eliminated the 4x deprecations and was no longer constrained by
  compatibility with the 4x index – let me know if I am wrong about that
in
  any way! But I did see a comment on one Jira referring to “preparation
for a
  5.0 release”, so I wanted to inquire about intentions. So, is a 5.0
release
  “coming soon”, or are 4.11, 4.12, 4.13... equally likely?
 
 
  I created a branch_5x because 3.x index support was responsible for
  multiple recent corruption bugs, some of which starting impacting 4.x
  indexes.
 
  Especially bad were:
  LUCENE-5907: 3.x back compat code corrupts (not just can't read) your
index.
  LUCENE-5934: 3.x back compat code corrupts (not just can't read) your
4.0 index.
  LUCENE-5975: 3.x back compat code reports a false corruption (was
  indeed a bug in those versions of lucene) for 3.0-3.3 indexes.
 
  Whenever I see patterns in corruptions then I see it as a systemic
  problem and aggressively work to do something about it. I've seen
  several lately, but these are the relevant ones:
 
  3.x back compat: 3.x didn't have a codec API, so its wedged in, and
  pretty hard. Its not that we were lazy, its that its radically
  different: doesn't separate data by fields, sorts terms differently,
  uses shared docstores, writes field numbers implicitly, ... We try to
  emulate it the best we can for testing, but the emulation can't really
  be perfect, so in such places: surprise, bugs. The only way to stop
  these corruptions is to stop supporting it.
 
  test infrastructure: IMO lucene 4 wasn't really ready to support
  multiple index formats from a test perspective, so we cheated and try
  to emulate old formats and rotate them across all tests. This works
  ok, but its horrible to debug (since
  these are essentially integration tests), the false failure rate is
  extremely high, and the complexity of the implementation is high. Its
  not just that it misses to find some bugs, it was actually directly