+1, this sounds like a great solution. It simplifies the APIs (no more required Version to Analyzer), it consolidates the version logic to a "single source", dot releases are first class.
Mike McCandless http://blog.mikemccandless.com On Fri, Aug 1, 2014 at 7:47 PM, Ryan Ernst <[email protected]> wrote: > There has been a lot of heated discussion recently about version > tracking in Lucene [1] [2]. I wanted to have a fresh discussion > outside of jira to give a full description of the current state of > things, the problems I have heard, and a proposed solution. > > CURRENT > > We have 2 pieces of code that handle “versioning.” The first is > Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo > for each segment. This is a string version which is used to detect > when the current version of lucene is newer than the version that > wrote the segment (and how/if an upgrade to to a newer codec should be > done). There is some complication with the “display” version and > non-display version, which are distinguished by whether the version of > lucene was an official release, or an alpha/beta version (which was > added specifically for the 4.0.0 release ramp up). This string > version also has its own parsing and comparison methods. > > The second piece of versioning code is in Version.java, which is an > enum used by analyzers to maintain backwards compatible behavior given > a specific version of lucene. The enum only contains values for dot > releases of lucene, not bug fixes (which was what spurred the recent > discussions over version). Analyzers’ constructors take a required > Version parameter, which is only actually used by the few analyzers > that have changed behavior recently. Version.java contains a separate > version parsing and comparison methods. > > > CONCERNS > > * Having 2 different pieces of code that do very similar things is > confusing for development. Very few developers appear to really > understand the current system (especially when trying to understand > the alpha/beta setup). > > * Users are generally confused by the Version passed to analyzers: I > know I was when I first started working with Lucene, and > Version.CURRENT_VERSION was deprecated because users used that without > understanding the implications. > > * Most analyzers currently have dead code constructors, since they > never make use of Version. There are also a lot of classes used by > analyzers which contain similar dead code. > > * Backwards compatibility needs to be handled in some fashion, to > ensure users have a path to upgrade from one version of lucene to > another, without requiring immediate re-indexing. > > > PROPOSAL > > I propose the following: > > * Consolidate all version related enumeration, including reading and > writing string versions, into Version.java. Have a static method that > returns the current lucene version (replacing > Constants.LUCENE_MAIN_VERSION). > > * Make bug fix releases first class in the enumeration, so that they > can be distinguished for any compatibility issues that come up. > > * Remove all snapshot/alpha/beta versioning logic. Alpha/beta was > really only necessary for 4.0 because of the extreme changes that were > being made. The system is much more stable now, and 5.0 should not > require preview releases, IMO. I don’t think snapshots should be a > concern because any user building an index from an unreleased build > (which they built themselves) is just asking for trouble. They do so > at their own risk (of figuring out how to upgrade their indexes if > they are not trash-able). Backwards compatibility can be handled by > adding the alpha/beta/final versions of 4.0 to the enum (and special > parsing logic for this). If lucene changes so much that we need > alpha/beta type discrimination in the future, we can revisit the > system if simply having extra versions in the enum won't work. > > * Analyzers constructors should have Version removed, and a setter > should be added which allows production users to set the version used. > This way any analyzers can still use version if it is set to something > other than current (which would be the default), but users simply > prototyping do not need to worry about it. > > * Classes that analyzers use, which take Version, should have Version > removed, and the analyzers should choose which settings/variants of > those classes to use based on the version they have set. In other > words, all version variant logic should be contained within the > analyzers. For example, Lucene47WordDelimiterFilter, or > StandardAnalyzer can take the unicode version. > Factories could still take Version (e.g. TokenizerFactory, > TokenFilterFactory, etc) to produce the correct component (so nothing > will change for solr in this regard). > > I’m sure not everyone will be happy with what I have proposed, but I’m > hoping we can work out a solution together, and then implement in a > team-like fashion, the way I have seen the community work in the past, > and I hope to see again in the future. > > Thanks > Ryan > > [1] https://issues.apache.org/jira/browse/LUCENE-5850 > [2] https://issues.apache.org/jira/browse/LUCENE-5859 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
