Re: Lucene does NOT use UTF-8.

Doug Cutting Tue, 30 Aug 2005 12:48:12 -0700

Yonik Seeley wrote:

I've been looking around... do you have a pointer to the source where justthe suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the problemthat would be posed by the prefix length being a byte count.


TermBuffer.java:66

Things could work fine if the prefix length were a byte count. A bytebuffer could easily be constructed that contains the full byte sequence(prefix + suffix), and then this could be converted to a String. Theinefficiency would be if prefix were re-converted from UTF-8 for eachterm, e.g., in order to compare it to the target. Prefixes arefrequently longer than suffixes, so this could be significant. Doesthat make sense? I don't know whether it would actually be significant,although TermBuffer.java was added recently as a measurable performanceenhancement, so this is performance critical code.

We need to stop discussing this in the abstract and start codingalternatives and benchmarking them. Is java.nio.charset.CharsetEncoderfast enough? Will moving things through CharBuffer and ByteBuffer betoo slow? Should Lucene keep maintaining its own UTF-8 implementationfor performance? I don't know, only some experiments will tell.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to