All of the JDK source is available via download from Sun. -----Original Message----- From: Marvin Humphrey [mailto:[EMAIL PROTECTED] Sent: Monday, October 31, 2005 6:31 PM To: java-dev@lucene.apache.org Subject: Re: bytecount as String and prefix length
I wrote... > I think I'll take a crack at a custom charsToUTF8 converter algo. Still no luck. Still 20% slower than the current implementation. The algo is below, for reference. It's entirely possible that my patches are doing something dumb that's causing this, given my limited experience with Java. But if that's not the case, I can think of two other explanations. One is that the passage of the text through an intermediate buffer before blasting it out is considerably more expensive than anticipated. The other is that the pre-allocation of a char[] array based on the length VInt yields a significant benefit over the standard techniques for reading in UTF-8. That wouldn't be hard to believe. Without that number, there's a lot of guesswork involved. English requires about 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memory allocation ops may be required as bytes get read in, especially if the final String object kicked out HAS to use the bare minimum amount of memory. I don't suppose there's any way for me to snoop just what's happening under the hood in these CharsetDecoder classes or String constructors, is there? Scanning through a SegmentTermEnum with next() doesn't seem to be any slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs benchmarker doesn't slow down that much when IndexInput is changed to use a String constructor that accepts UTF-8 bytes rather than chars. However, it's possible that the modified toTerm method of TermBuffer is a bottleneck, as it also uses the UTF-8 String constructor. It doesn't get exercised under SegmentTermEnum.next(), but during merging of segments I believe it sees plenty of action -- maybe a lot more than IndexInput's readString. So my next step is to write a utf8ToString method that's as efficient as I can make it. After that... I dunno, I'm running out of ideas. Marvin Humphrey Rectangular Research http://www.rectangular.com/ public static final ByteBuffer stringToUTF8( String s, int start, int length, ByteBuffer byteBuf) { byteBuf.clear(); int i = start; int j = 0; try { final int end = start + length; byte[] bytes = byteBuf.array(); for ( ; i < end; i++) { final int code = (int)s.charAt(i); if (code < 0x80) bytes[j++] = (byte)code; else if (code < 0x800) { bytes[j++] = (byte)(0xC0 | (code >> 6)); bytes[j++] = (byte)(0x80 | (code & 0x3F)); } else if (code < 0xD800 || code > 0xDFFF) { bytes[j++] = (byte)(0xE0 | (code >>> 12)); bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F)); bytes[j++] = (byte)(0x80 | (code & 0x3F)); } else { // surrogate pair int utf32; // confirm valid high surrogate if (code < 0xDC00 && (i < end-1)) { utf32 = ((int)s.charAt(i+1)); // confirm valid low surrogate and write pair if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) { utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF); i++; bytes[j++] = (byte)(0xF0 | (utf32 >>> 18)); bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f)); bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F)); bytes[j++] = (byte)(0x80 | (utf32 & 0x3F)); continue; } } // replace unpaired surrogate or out-of-order low surrogate // with substitution character bytes[j++] = (byte)0xEF; bytes[j++] = (byte)0xBF; bytes[j++] = (byte)0xBD; } } } catch (ArrayIndexOutOfBoundsException e) { // guess how many more bytes it will take, plus 10% float charsProcessed = (float)(i - start); float bytesPerChar = (j / charsProcessed) * 1.1f; float charsLeft = length - charsProcessed; float targetSize = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f; return stringToUTF8(s, start, length, ByteBuffer.allocate((int) targetSize)); } byteBuf.position(j); return byteBuf; } --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]