I wrote...
I think I'll take a crack at a custom charsToUTF8 converter algo.
Still no luck. Still 20% slower than the current implementation.
The algo is below, for reference.
It's entirely possible that my patches are doing something dumb
that's causing this, given my limited experience with Java. But if
that's not the case, I can think of two other explanations.
One is that the passage of the text through an intermediate buffer
before blasting it out is considerably more expensive than anticipated.
The other is that the pre-allocation of a char[] array based on the
length VInt yields a significant benefit over the standard techniques
for reading in UTF-8. That wouldn't be hard to believe. Without
that number, there's a lot of guesswork involved. English requires
about 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memory
allocation ops may be required as bytes get read in, especially if
the final String object kicked out HAS to use the bare minimum amount
of memory. I don't suppose there's any way for me to snoop just
what's happening under the hood in these CharsetDecoder classes or
String constructors, is there?
Scanning through a SegmentTermEnum with next() doesn't seem to be any
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs
benchmarker doesn't slow down that much when IndexInput is changed to
use a String constructor that accepts UTF-8 bytes rather than chars.
However, it's possible that the modified toTerm method of TermBuffer
is a bottleneck, as it also uses the UTF-8 String constructor. It
doesn't get exercised under SegmentTermEnum.next(), but during
merging of segments I believe it sees plenty of action -- maybe a lot
more than IndexInput's readString.
So my next step is to write a utf8ToString method that's as efficient
as I can make it. After that... I dunno, I'm running out of ideas.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
public static final ByteBuffer stringToUTF8(
String s, int start, int length, ByteBuffer byteBuf) {
byteBuf.clear();
int i = start;
int j = 0;
try {
final int end = start + length;
byte[] bytes = byteBuf.array();
for ( ; i < end; i++) {
final int code = (int)s.charAt(i);
if (code < 0x80)
bytes[j++] = (byte)code;
else if (code < 0x800) {
bytes[j++] = (byte)(0xC0 | (code >> 6));
bytes[j++] = (byte)(0x80 | (code & 0x3F));
} else if (code < 0xD800 || code > 0xDFFF) {
bytes[j++] = (byte)(0xE0 | (code >>> 12));
bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
bytes[j++] = (byte)(0x80 | (code & 0x3F));
} else {
// surrogate pair
int utf32;
// confirm valid high surrogate
if (code < 0xDC00 && (i < end-1)) {
utf32 = ((int)s.charAt(i+1));
// confirm valid low surrogate and write pair
if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
i++;
bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
continue;
}
}
// replace unpaired surrogate or out-of-order low surrogate
// with substitution character
bytes[j++] = (byte)0xEF;
bytes[j++] = (byte)0xBF;
bytes[j++] = (byte)0xBD;
}
}
}
catch (ArrayIndexOutOfBoundsException e) {
// guess how many more bytes it will take, plus 10%
float charsProcessed = (float)(i - start);
float bytesPerChar = (j / charsProcessed) * 1.1f;
float charsLeft = length - charsProcessed;
float targetSize
= (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;
return stringToUTF8(s, start, length, ByteBuffer.allocate((int)
targetSize));
}
byteBuf.position(j);
return byteBuf;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]