On May 1, 2006, at 7:33 PM, Chuck Williams wrote: > Could someone summarize succinctly why it is considered a > major issue that Lucene uses the Java modified UTF-8 > encoding within its index rather than the standard UTF-8 > encoding. Is the only concern compatibility with index > formats in other Lucene variants?
I originally raised a stink about "Modified UTF-8" because at the time I was embroiled in an effort to implement the Lucene file format, and the Lucene File Formats document claimed to be using "UTF-8", straight up. It was most unpleasant to discover that if my app read legal UTF-8, Lucene-generated indexes would cause it to crash from time to time, and that if it wrote legal UTF-8, the indexes it generated would cause Lucene to crash from time to time.
More problematic than the "Modified UTF-8" actually, is the definition of a Lucene String. According to the File Formats document, "Lucene writes strings as a VInt representing the length, followed by the character data." The word "length" is ambiguous in that context, and at first I took it to mean either length in Unicode code points or bytes. It was a nasty shock to discover that it was actually Java chars. Bizarre and painful contortions were suddenly required for encoding/decoding a term dictionary which would otherwise have been completely unnecessary.
I used to think that the Lucene file format might serve as "the TIFF of inverted indexes". My perspective on this has changed. Lucene's file format is just beastly difficult to implement from scratch, and anything short of full implementation guarantees occasional "Read past EOF" errors on interchange. Personally, I would assess the file format as the secondary expression of a beautiful algorithmic design. Ease of interchange and ease of implementation do not seem to have been primary design considerations -- which is perfectly reasonable, if true, but perhaps then it should not aspire to serve as a vehicle for interchange. As was asserted in the recent thread on ACID compliance, the indexes produced by a full-text indexer are not meant to serve as primary document storage. It's common to need to move a TIFF or a text file from system to system. It's not common to need to move a derived index.
Compatibility has its advantages. It was pretty nice to be able to browse KinoSearch-generated indexes using Luke, once I managed to achieve compatibility for all-ascii source material. But holy crow, was it tough to debug those indexes. No human readable components. No fixed block sizes. No facilities for resyncing a stream once it gets off. All that on top of the "Modified UTF-8" and the String definition.
At this point I think the suggestion of turning the File Formats document from an ostensible spec into a piece of ordinary documentation is a worthy one. FWIW, I've pretty much given up on the idea of making KinoSearch and Lucene file-format-compatible. In my weaker moments I imagine that I might sell the Lucene community on the changes that would be necessary. Then I remember that many of you live in a world where "Modified UTF-8" isn't an abomination. ;)
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]