Re: deprecating Versions

DM Smith Mon, 29 Nov 2010 06:05:39 -0800

On Nov 29, 2010, at 5:34 AM, Robert Muir wrote:

> On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <[email protected]> wrote:
>> And for indexes:
>> * Index compatibility is guaranteed across two adjacent major
>> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>>  That includes both binary compat - codecs, and semantic compat -
>> analyzers (if appropriate Version is used).
>> * Older releases are most probably unsupported.
>>  e.g. 4.x still supports shared docstores for reading, though never
>> writes them. 5.x won't read them either, so you'll have to at least
>> fully optimize your 3.x indexes when going through 4.x to 5.x.
>> 
> 
> Is it somehow possible i could convince everyone that all the
> analyzers we provide are simply examples?


It really doesn't solve the problem. Analyzers are not much more than tokenizer 
and zero or more filters chained in an ordered manner. Right now, the "more" is 
the special code regarding reuse.

In my project, I don't use any of the Analyzers that Lucene provides, but I 
have variants of them. (Mine allow take flags indicating whether to filter stop 
words and whether to do stemming). The effort recently has been to change these 
analyzers to follow the new reuse pattern to improve performance.

Having a declarative mechanism and I wouldn't have needed to make the changes.

WRT to an analyzer, if any of the following changes, all bets are off:
    Tokenizer (i.e. which tokenizer is used)
    The rules that a tokenizer uses to break into tokens. (E.g. query parser, 
break iterator, ...)
    The type associated with each token (e.g. word, number, url, .... )
    Presence/Absence of a particular filter
    Order of filters
    Tables that a filter uses
    Rules that a filter encodes
    The version and implementation of Unicode being used (whether via ICU, 
Lucene and/or Java)
    Bugs fixed in these components.
(This list is adapted from an email I wrote to a user's group explaining why 
texts need to be re-indexed.)

Additionally, it is the user's responsibility to normalize the text, probably 
to NFC or NFKC, before index and search. (It may need to precede the Tokenizer 
if it is not Unicode aware. E.g. what does a LetterTokenizer do if input is NFD 
and it encounters an accent?)

Recently, we've seen that there is some mistrust here in JVMs at the same 
version level from different vendors (Sun, Harmony, IBM) in producing the same 
results. (IIRC: Thai break iterator. Random tests.)

For the most part, searching the index will seem to be fine. It may only be 
edge cases that cause problems.

Adding documents to an index with a changed Analyzer might not be a good thing. 
It might result in a question of "Why does my search find this Document, but 
not that Document. Both should be returned.")

Within a release of Lucene, a small handful of analyzers may have changed 
sufficiently to warrant re-index of indexes built with them.

For me the bigger problem is that the parts of analyzer are not separately 
versioned. It is not simply a matter of using a lucene-analyzers-XX.YY.jar. 
That is too coarse grained. Each release has new goodness regarding analysis of 
non-english texts and performance regarding all texts. If I want any or all of 
that, I have two choices:
a) Upgrade and rebuild every index. Since the desktop application does not know 
if a change requires rebuild, everything must be rebuilt.
or
b) Fork all the components I use. (To me this is just wrong, but perhaps 
necessary/expedient.)
or
c) version the names of the packages and/or classes. (I don't like this idea 
either, but it works)

Given that the releases of Lucene and my application are infrequent (so much 
for the release often mantra) forcing a rebuild is not such a horrible thing 
for me.

So basically, I have given up on Lucene being backward compatible where it 
matters the most to me: Stable analyzer components. The gain I get from this 
admission is far better. YMMV.

Hope this helps,
        DM


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: deprecating Versions

Reply via email to