On 11/29/2010 03:43 PM, Earwin Burrfoot wrote:
On Mon, Nov 29, 2010 at 20:51, DM Smith<dmsmith...@gmail.com>  wrote:
The other thing I'd like is for the spec to be save along side of the index
as a manifest. From earlier threads, I can see that there might need to be
one for writing and another for reading. I'm not interested in using it to
construct an analyzer, but to determine whether the index is invalid wrt to
the analyzer currently in use.
You can already implement such behaviour with 3.x branch of Lucene.
It has IW.commit(Map<String, String>  userdata) method, that allows you
to commit with arbitrary payload, that binds to segment and can be
read back later.

Cool. I forgot entirely about that.

I think there is a problem with deprecating and removing constants too.
In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x
indexes. From an analyzer perspective, an index is invalid if the analyzer
would produce a different token stream for the same input. If the 2.x
version constants are gone, then the index built with 2.x version
constants is no longer valid. (It might be valid, but how can one have any
confidence of that?) Upgrading the index to the new internal format
cannot change this. A buggy lowercase Turkish word will still be buggy
after upgrade. (This is a 3.0 version constant that in 5.0 will still need to 
be around).
I think it was declared that Lucene does not provide index
compatibility across more than a single major revision.
Thus, we don't guarantee reading 2.x index with 4.0 Lucene. So, we can
drop 2.x constants and compatibility.
But we still have to support 3.x. In version 5.0 then we're dropping
3.x constants and support for bugs/deprecated
features of 3.x.

Yes, you are correct that 4.0 may but is not guaranteed to read 2.x. My bad, yet again. I went back to the threads regarding this around May 25 and it also was decided that 4.x might not be able to read 3.x, but will provide a migration tool in such a case.

That said, my point still stands. The 3.0 version constant which is used by an analyzer to preserve 3.0 behavior will need to be retained for the sake of analyzers in 5.0. Or the index will need to be rebuilt from original input. (I'm referencing the 3.0 rather than a 2.x because of the example I have in mind)

The tokens in the 3.0 index that is migrated to a 4.0 index still have tokens produced by an analyzer that was buggy. Example, a Turkish index with the wrong lower case i (Prior to LUCENE-2101, it would lowercase to i. After: İ (dotted capital I) => i ("regular" lower case i) and I ("regular" upper case I) => 𝚤 (dotless lower case i)). This very commonly occurs in Turkish text. So the 4.0 index, still using 3.0 version constant to get expected behavior, works as it always did.

Now in 5.0, there might be a migration tool or it will be able to read a 4.x index. If the 3.0 constant is gone and none of these tokens are reachable. Search requests will have the correct lower case i and will not be able to find those with the wrong one. It will be very obvious.

Regarding this analyzer, code that uses a 2.x version constant for this analyzer will need to change to a 3.0 version constant in order for the index to be usable in the 4.x series if the 2.x constants are removed.

I don't think this is an isolated example.

With what's happening, every index that uses a deprecated version constant will have one very long major release cycle in which to rebuild their indexes from scratch.

And as I said at the bottom of my last email, I'm going to re-index because I am able and because I want correct behavior. So whatever is decided won't affect my application of Lucene.

-- DM






---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to