On 11/29/2010 01:03 PM, Robert Muir wrote:
On Mon, Nov 29, 2010 at 12:51 PM, DM Smith<dmsmith...@gmail.com>  wrote:
I'd have to look to be sure: IIRC, Turkish was one. The treatment of 'i' was
buggy. Russian had it's own encoding that was replaced with UTF-8. The
QueryParser had bug fixes. There is some effort to migrate away from stemmer
to snowball, but at least the Dutch one is not "identical".

but none of these broke backwards compatibility, they all respect the
Version constant!
The SnowballAnalyzer respects the version constant for the buggy
turkish lowercasing! If you use VERSION.LUCENE_30 (or less) it wrongly
lowercases so you get your old buggy behavior.

Even the old buggy Dutch stemmer is still there, and if you use
DutchAnalyzer(Version.LUCENE_30) (or less) it stems incorrectly so you
get your old buggy behavior!

The russian was the same way, same with the QueryParser.

So I'm sorry, I am left confused about where the backwards breaks are?
Strictly speaking there are none, in the present. The user of Lucene can choose to break compatibility and retain old (and in these cases, buggy) behavior. This maintains Lucene's bw-compat policy.

This thread talked about removing the Version constants in the future? I went back and re-read the thread. Perhaps I misunderstood. I saw several thoughts: Deprecate version constants 1 version back and remove those 2 versions back.
Remove all version constants and use versioned jars instead.

If there is no way to select a prior behavior except to select a single jar that had lots of analyzers (or analyzer parts) in it, then I'm stuck with older code that is perhaps buggy. I can't pick a later analyzer for English and an earlier, buggy analyzer for Turkish. I have to get all of them from one jar. (Unless we get into renaming packages and/or classes). So I can't get some improvements while ignoring others.

I think there is a problem with deprecating and removing constants too. In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x indexes. From an analyzer perspective, an index is invalid if the analyzer would produce a different token stream for the same input. If the 2.x version constants are gone, then the index built with 2.x version constants is no longer valid. (It might be valid, but how can one have any confidence of that?) Upgrading the index to the new internal format cannot change this. A buggy lowercase Turkish word will still be buggy after upgrade. (This is a 3.0 version constant that in 5.0 will still need to be around).

We either need more frequent releases (forcing the issue earlier and eliminating stale code earlier) or something's gotta give.

That said. As a user, I don't care any more. I'll give. The benefit of a better index outweighs backward compatibility for me.

-- DM


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to