Re: storing term text internally as byte array and bytecount as prefix, etc.

Doug Cutting Fri, 05 May 2006 08:15:32 -0700

Marvin Humphrey wrote:

More problematic than the "Modified UTF-8" actually, is the definitionof a Lucene String. According to the File Formats document, "Lucenewrites strings as a VInt representing the length, followed by thecharacter data." The word "length" is ambiguous in that context, and atfirst I took it to mean either length in Unicode code points or bytes.It was a nasty shock to discover that it was actually Java chars.Bizarre and painful contortions were suddenly required forencoding/decoding a term dictionary which would otherwise have beencompletely unnecessary.

Yes, this should be corrected. The problem is that "length" refers tothe length of the Java string, but that is not explicit. Moreover, asyou have pointed out, that is a bad choice for non-Java implementations.

Ease ofinterchange and ease of implementation do not seem to have been primarydesign considerations -- which is perfectly reasonable, if true, butperhaps then it should not aspire to serve as a vehicle forinterchange.

The index format document was written years after Lucene was written,after Lucene had alredy been ported to other languages. It seemed likea good idea to document what folks were porting. Ease of interchangeand implementation were not primary considerations when Lucene wasdeveloped. That said, at the time Lucene was first written (1997),Unicode was only 16-bit and there was no discrepancy between Java'smodified encoding and UTF-8.

At this point I think the suggestion of turning the File Formatsdocument from an ostensible spec into a piece of ordinary documentationis a worthy one. FWIW, I've pretty much given up on the idea of makingKinoSearch and Lucene file-format-compatible. In my weaker moments Iimagine that I might sell the Lucene community on the changes that wouldbe necessary.

Please do. But suggestions without working patches are not always actedon. Most of us are busy with other projects, and only advance Lucenewhen we have a need, or someone provides a patch. Ideally we need tofind someone who *needs* an index format that's easily interchangeablebetween Java and other languages to push this forward.

Then I remember that many of you live in a world where"Modified UTF-8" isn't an abomination. ;)

Modified UTF-8 is not anyone's choice. It's simply what's used by Java.What are we supposed to do, picket Sun? If we move to make Lucene'sfile format an interchange format, then we must clearly move beyond it.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to