Marvin Humphrey wrote:
More problematic than the "Modified UTF-8" actually, is the definition of a Lucene String. According to the File Formats document, "Lucene writes strings as a VInt representing the length, followed by the character data." The word "length" is ambiguous in that context, and at first I took it to mean either length in Unicode code points or bytes. It was a nasty shock to discover that it was actually Java chars. Bizarre and painful contortions were suddenly required for encoding/decoding a term dictionary which would otherwise have been completely unnecessary.
Yes, this should be corrected. The problem is that "length" refers to the length of the Java string, but that is not explicit. Moreover, as you have pointed out, that is a bad choice for non-Java implementations.
Ease of interchange and ease of implementation do not seem to have been primary design considerations -- which is perfectly reasonable, if true, but perhaps then it should not aspire to serve as a vehicle for interchange.
The index format document was written years after Lucene was written, after Lucene had alredy been ported to other languages. It seemed like a good idea to document what folks were porting. Ease of interchange and implementation were not primary considerations when Lucene was developed. That said, at the time Lucene was first written (1997), Unicode was only 16-bit and there was no discrepancy between Java's modified encoding and UTF-8.
At this point I think the suggestion of turning the File Formats document from an ostensible spec into a piece of ordinary documentation is a worthy one. FWIW, I've pretty much given up on the idea of making KinoSearch and Lucene file-format-compatible. In my weaker moments I imagine that I might sell the Lucene community on the changes that would be necessary.
Please do. But suggestions without working patches are not always acted on. Most of us are busy with other projects, and only advance Lucene when we have a need, or someone provides a patch. Ideally we need to find someone who *needs* an index format that's easily interchangeable between Java and other languages to push this forward.
Then I remember that many of you live in a world where "Modified UTF-8" isn't an abomination. ;)
Modified UTF-8 is not anyone's choice. It's simply what's used by Java. What are we supposed to do, picket Sun? If we move to make Lucene's file format an interchange format, then we must clearly move beyond it.
Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]