Re: bytecount as String and prefix length

Marvin Humphrey Mon, 31 Oct 2005 16:31:36 -0800

I wrote...

I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck. Still 20% slower than the current implementation.The algo is below, for reference.

It's entirely possible that my patches are doing something dumbthat's causing this, given my limited experience with Java. But ifthat's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate bufferbefore blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on thelength VInt yields a significant benefit over the standard techniquesfor reading in UTF-8. That wouldn't be hard to believe. Withoutthat number, there's a lot of guesswork involved. English requiresabout 1.1 bytes per UTF-8 code point; Japanese, 3. Multiple memoryallocation ops may be required as bytes get read in, especially ifthe final String object kicked out HAS to use the bare minimum amountof memory. I don't suppose there's any way for me to snoop justwhat's happening under the hood in these CharsetDecoder classes orString constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be anyslower with a byte-based TermBuffer, and my index-1000-wikipedia-docsbenchmarker doesn't slow down that much when IndexInput is changed touse a String constructor that accepts UTF-8 bytes rather than chars.However, it's possible that the modified toTerm method of TermBufferis a bottleneck, as it also uses the UTF-8 String constructor. Itdoesn't get exercised under SegmentTermEnum.next(), but duringmerging of segments I believe it sees plenty of action -- maybe a lotmore than IndexInput's readString.

So my next step is to write a utf8ToString method that's as efficientas I can make it. After that... I dunno, I'm running out of ideas.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


  public static final ByteBuffer stringToUTF8(
        String s, int start, int length, ByteBuffer byteBuf) {
    byteBuf.clear();
    int i = start;
    int j = 0;
    try {
      final int end = start + length;
      byte[] bytes = byteBuf.array();
      for ( ; i < end; i++) {
        final int code = (int)s.charAt(i);
        if (code < 0x80)
          bytes[j++] = (byte)code;
        else if (code < 0x800) {
          bytes[j++] = (byte)(0xC0 | (code >> 6));
          bytes[j++] = (byte)(0x80 | (code & 0x3F));
        } else if (code < 0xD800 || code > 0xDFFF) {
          bytes[j++] = (byte)(0xE0 | (code >>> 12));
          bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
          bytes[j++] = (byte)(0x80 | (code & 0x3F));
        } else {
          // surrogate pair
          int utf32;
          // confirm valid high surrogate
          if (code < 0xDC00 && (i < end-1)) {
            utf32 = ((int)s.charAt(i+1));
            // confirm valid low surrogate and write pair
            if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
              utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
              i++;
              bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
              bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
              bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
              bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
              continue;
            }
          }
          // replace unpaired surrogate or out-of-order low surrogate
          // with substitution character
          bytes[j++] = (byte)0xEF;
          bytes[j++] = (byte)0xBF;
          bytes[j++] = (byte)0xBD;
        }
      }
    }
    catch (ArrayIndexOutOfBoundsException e) {
      // guess how many more bytes it will take, plus 10%
      float charsProcessed = (float)(i - start);
      float bytesPerChar = (j / charsProcessed) * 1.1f;

      float charsLeft = length - charsProcessed;
      float targetSize
        = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;

return stringToUTF8(s, start, length, ByteBuffer.allocate((int)targetSize));

    }
    byteBuf.position(j);
    return byteBuf;
  }





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bytecount as String and prefix length

Reply via email to