On Jan 17, 2007, at 3:42 AM, Grant Ingersoll wrote:

I think since we have already made some file format changes, we should consider some of the others on the table, namely https:// issues.apache.org/jira/browse/LUCENE-510 which concerns proper UTF-8 storage. The big issue with this one seems to be performance (and the patch needs to be updated) but it, as Marvin has stated, is what would allow us to do the Kino merge model, if desired, and would provide better compatibility w/ our sibling projects (an important consideration, but should not be the driver.)

I'm pleased to see you bring these to the fore, as they're issues I care about and have spent considerable time on. However, I would not hold up a 2.1 release for either proper UTF-8 or bytecount strings.

In addition to the performance considerations, the patch as it currently stands completely destroys backwards compatibility -- only indexes consisting of pure ASCII source material created pre-patch are still able to be read post-patch.

Switching to official UTF-8 on it's own is not backwards compatible as discussed here: <http://xrl.us/uawo> (Link to mail- archives.apache.org)

Switching to bytecount based strings is likewise a real headache for backwards compat. We might have to do something like subclass IndexInput and IndexOutput and choose a version based on segment format. Even then, it's tricky because of how to deal with string diffs.

I promise to update the bytecounts/utf8 patch after KS 0.20_01 is done, but I can't get to it before that. There's a lot of pressure on me to get a new version of KS out the door.

Your more general point about batching up file format changes reflects what I've always thought, but I wonder... Doug has laid out a backwards compatibility policy about always reading stuff written one major version back. It occurs to me that the more frequently major versions get released, the more quickly we can dispense with crufty compatibility code. :)

Also, I'm curious as to how many people use NFS in live systems.

KS has the same problems Lucene does, and it's a common enough complaint that I've added an FAQ item. It's an important issue.

However, I don't have the faintest idea how to solve it.

So unless someone comes up with something simple and brilliant, I don't think it should stand in the way, either.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to