http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Go to 3.9 Unicode Encoding Forms. Or simply search D93

On 10/1/2011 2:21 PM, Ulf Zibis wrote:
Am 30.09.2011 22:46, schrieb Xueming Shen:
On 09/30/2011 07:09 AM, Ulf Zibis wrote:

(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop would again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not sure, if that is expected in this corner case.)

Unicode Standard's "best practices" D93a/b recommends to return 2 in this case.
Can you please give me a link for D93a/a. I don't know, where to find it.




3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>


I'm not sure I understand the suggested b1 < -0x3e patch, I don't see we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).
You must see the b1 < -0x3e in combination with the following b1 < -0x20. ;-)

But now I have a better "if...else if" switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable

I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because
the benchmark shows the "shift" version is slightly faster.
IIRC this was only about a shift by multiples of 8 to ensure an 1-byte comparison of 16/32-byte values in the double/quad-byte charsets.


Do you have any number
shows any difference now. My non-scientific benchmark still suggests the "shift"
type is faster on -server vm, no significant difference on -client vm.

  ------------------  your new switch---------------
(1) -server
Method                      Millis  Ratio
Decoding 1b UTF-8 :            125  1.000
Decoding 2b UTF-8 :           2558 20.443
Decoding 3b UTF-8 :           3439 27.481
Decoding 4b UTF-8 :           2030 16.221
(2) -client
Decoding 1b UTF-8 :            335  1.000
Decoding 2b UTF-8 :           1041  3.105
Decoding 3b UTF-8 :           2245  6.694
Decoding 4b UTF-8 :           1254  3.741

  ------------------ existing "shift"---------------
(1) -server
Decoding 1b UTF-8 :            134  1.000
Decoding 2b UTF-8 :           1891 14.106
Decoding 3b UTF-8 :           2934 21.886
Decoding 4b UTF-8 :           2133 15.913
(2) -client
Decoding 1b UTF-8 :            341  1.000
Decoding 2b UTF-8 :            949  2.560
Decoding 3b UTF-8 :           2321  6.255
Decoding 4b UTF-8 :           1278  3.446

Very interesting and surprising numbers!
The most surprising is, that the client compiler generates faster code for 2..4-byte codes. I think, we should ask the HotSpot team for help. As the UTF-8 de/encoding is a very frequent task, HotSpot should provide compiled code as optimized best as possible for UTF-8 de/encoding. Another surprise is, that benchmark for 1b UTF-8 is not same for "new switch" and "shift" version, as the ASCII only loop is the same in both versions. To discover the miracle, why the"shift" version is little faster than the "new switch" version, it should be helpful, to see the disassembling of the HotSpot compiled code. A third version, using the "(b1 & 0xe0) == 0xc0"/"(b1 & 0xf0) == 0xe0"/"(b1 & 0xf8) == 0xf0" pattern, should be interesting toofor the benchmark comparison.

In my opinion it would be more significant to compare x 1..4-byte codes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codes against 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codes against 4000 bytes of 4-byte codes

We should document somewhere, that the ESU-8 decoder is faster than the strong UTF-8 decoder for developers, who can ensure, that there are no invalid surrogates in their source bytes.

-Ulf

Reply via email to