Yonik Seeley wrote:
I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8?

I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count.

TermBuffer.java:66

Things could work fine if the prefix length were a byte count. A byte buffer could easily be constructed that contains the full byte sequence (prefix + suffix), and then this could be converted to a String. The inefficiency would be if prefix were re-converted from UTF-8 for each term, e.g., in order to compare it to the target. Prefixes are frequently longer than suffixes, so this could be significant. Does that make sense? I don't know whether it would actually be significant, although TermBuffer.java was added recently as a measurable performance enhancement, so this is performance critical code.

We need to stop discussing this in the abstract and start coding alternatives and benchmarking them. Is java.nio.charset.CharsetEncoder fast enough? Will moving things through CharBuffer and ByteBuffer be too slow? Should Lucene keep maintaining its own UTF-8 implementation for performance? I don't know, only some experiments will tell.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to