On Jan 17, 2007, at 3:42 AM, Grant Ingersoll wrote:
I think since we have already made some file format changes, we
should consider some of the others on the table, namely https://
issues.apache.org/jira/browse/LUCENE-510
which concerns proper UTF-8 storage. The big issue with this one
seems to be performance (and the patch needs to be updated) but it,
as Marvin has stated, is what would allow us to do the Kino merge
model, if desired, and would provide better compatibility w/ our
sibling projects (an important consideration, but should not be the
driver.)
I'm pleased to see you bring these to the fore, as they're issues I
care about and have spent considerable time on. However, I would not
hold up a 2.1 release for either proper UTF-8 or bytecount strings.
In addition to the performance considerations, the patch as it
currently stands completely destroys backwards compatibility -- only
indexes consisting of pure ASCII source material created pre-patch
are still able to be read post-patch.
Switching to official UTF-8 on it's own is not backwards compatible
as discussed here: <http://xrl.us/uawo> (Link to mail-
archives.apache.org)
Switching to bytecount based strings is likewise a real headache for
backwards compat. We might have to do something like subclass
IndexInput and IndexOutput and choose a version based on segment
format. Even then, it's tricky because of how to deal with string
diffs.
I promise to update the bytecounts/utf8 patch after KS 0.20_01 is
done, but I can't get to it before that. There's a lot of pressure
on me to get a new version of KS out the door.
Your more general point about batching up file format changes
reflects what I've always thought, but I wonder... Doug has laid out
a backwards compatibility policy about always reading stuff written
one major version back. It occurs to me that the more frequently
major versions get released, the more quickly we can dispense with
crufty compatibility code. :)
Also, I'm curious as to how many people use NFS in live systems.
KS has the same problems Lucene does, and it's a common enough
complaint that I've added an FAQ item. It's an important issue.
However, I don't have the faintest idea how to solve it.
So unless someone comes up with something simple and brilliant, I
don't think it should stand in the way, either.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]