Re: storing term text internally as byte array and bytecount as prefix, etc.

Marvin Humphrey Wed, 03 May 2006 16:48:27 -0700


On May 1, 2006, at 7:33 PM, Chuck Williams wrote:
> Could someone summarize succinctly why it is considered a
> major issue that Lucene uses the Java modified UTF-8
> encoding within its index rather than the standard UTF-8
> encoding.  Is the only concern compatibility with index
> formats in other Lucene variants?

I originally raised a stink about "Modified UTF-8" because at thetime I was embroiled in an effort to implement the Lucene fileformat, and theLucene File Formats document claimed to be using "UTF-8", straightup. It was most unpleasant to discover that if my app read legalUTF-8, Lucene-generated indexes would cause it to crash from time totime, and that if it wrote legal UTF-8, the indexes it generatedwould cause Lucene to crash from time to time.

More problematic than the "Modified UTF-8" actually, is thedefinition of a Lucene String. According to the File Formatsdocument, "Lucene writes strings as a VInt representing the length,followed by the character data." The word "length" is ambiguous inthat context, and at first I took it to mean either length in Unicodecode points or bytes. It was a nasty shock to discover that it wasactually Java chars. Bizarre and painful contortions were suddenlyrequired for encoding/decoding a term dictionary which wouldotherwise have been completely unnecessary.

I used to think that the Lucene file format might serve as "the TIFFof inverted indexes". My perspective on this has changed. Lucene'sfile format is just beastly difficult to implement from scratch, andanything short of full implementation guarantees occasional "Readpast EOF" errors on interchange. Personally, I would assess the fileformat as the secondary expression of a beautiful algorithmicdesign. Ease of interchange and ease of implementation do not seemto have been primary design considerations -- which is perfectlyreasonable, if true, but perhaps then it should not aspire to serveas a vehicle for interchange. As was asserted in the recent threadon ACID compliance, the indexes produced by a full-text indexer arenot meant to serve as primary document storage. It's common to needto move a TIFF or a text file from system to system. It's not commonto need to move a derived index.

Compatibility has its advantages. It was pretty nice to be able tobrowse KinoSearch-generated indexes using Luke, once I managed toachieve compatibility for all-ascii source material. But holy crow,was it tough to debug those indexes. No human readable components.No fixed block sizes. No facilities for resyncing a stream once itgets off. All that on top of the "Modified UTF-8" and the Stringdefinition.

At this point I think the suggestion of turning the File Formatsdocument from an ostensible spec into a piece of ordinarydocumentation is a worthy one. FWIW, I've pretty much given up onthe idea of making KinoSearch and Lucene file-format-compatible. Inmy weaker moments I imagine that I might sell the Lucene community onthe changes that would be necessary. Then I remember that many ofyou live in a world where "Modified UTF-8" isn't an abomination. ;)


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to