Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen Sat, 01 Oct 2011 23:34:00 -0700

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf


Go to 3.9 Unicode Encoding Forms. Or simply search D93

On 10/1/2011 2:21 PM, Ulf Zibis wrote:

Am 30.09.2011 22:46, schrieb Xueming Shen:
On 09/30/2011 07:09 AM, Ulf Zibis wrote:
(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->CoderResult.malformedForLength(1)It appears the Unicode Standard now explicitly recommends to returnthe malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind was, that in case of malformed length 1 a consecutivecall to the decode loop would again return another malformed length1, to ensure 2 replacement chars in the output string. (Not sure, ifthat is expected in this corner case.)
Unicode Standard's "best practices" D93a/b recommends to return 2 inthis case.
Can you please give me a link for D93a/a. I don't know, where to find it.
3. Consider additionally 6795537 - UTF_8$Decoder returns wrongresults <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
I'm not sure I understand the suggested b1 < -0x3e patch, I don'tsee we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).
You must see the b1 < -0x3e in combination with the following b1 <-0x20. ;-)
But now I have a better "if...else if" switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable
I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to2009(?) because
the benchmark shows the "shift" version is slightly faster.
IIRC this was only about a shift by multiples of 8 to ensure an 1-bytecomparison of 16/32-byte values in the double/quad-byte charsets.
Do you have any number
shows any difference now. My non-scientific benchmark still suggeststhe "shift"
type is faster on -server vm, no significant difference on -client vm.

  ------------------  your new switch---------------
(1) -server
Method                      Millis  Ratio
Decoding 1b UTF-8 :            125  1.000
Decoding 2b UTF-8 :           2558 20.443
Decoding 3b UTF-8 :           3439 27.481
Decoding 4b UTF-8 :           2030 16.221
(2) -client
Decoding 1b UTF-8 :            335  1.000
Decoding 2b UTF-8 :           1041  3.105
Decoding 3b UTF-8 :           2245  6.694
Decoding 4b UTF-8 :           1254  3.741

  ------------------ existing "shift"---------------
(1) -server
Decoding 1b UTF-8 :            134  1.000
Decoding 2b UTF-8 :           1891 14.106
Decoding 3b UTF-8 :           2934 21.886
Decoding 4b UTF-8 :           2133 15.913
(2) -client
Decoding 1b UTF-8 :            341  1.000
Decoding 2b UTF-8 :            949  2.560
Decoding 3b UTF-8 :           2321  6.255
Decoding 4b UTF-8 :           1278  3.446
Very interesting and surprising numbers!
The most surprising is, that the client compiler generates faster codefor 2..4-byte codes. I think, we should ask the HotSpot team for help.As the UTF-8 de/encoding is a very frequent task, HotSpot shouldprovide compiled code as optimized best as possible for UTF-8 de/encoding.Another surprise is, that benchmark for 1b UTF-8 is not same for "newswitch" and "shift" version, as the ASCII only loop is the same inboth versions.To discover the miracle, why the"shift" version is little faster thanthe "new switch" version, it should be helpful, to see thedisassembling of the HotSpot compiled code.A third version, using the "(b1 & 0xe0) == 0xc0"/"(b1 & 0xf0) ==0xe0"/"(b1 & 0xf8) == 0xf0" pattern, should be interesting toofor thebenchmark comparison.
In my opinion it would be more significant to compare x 1..4-bytecodes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codesagainst 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codesagainst 4000 bytes of 4-byte codes
We should document somewhere, that the ESU-8 decoder is faster thanthe strong UTF-8 decoder for developers, who can ensure, that thereare no invalid surrogates in their source bytes.
-Ulf

Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

Reply via email to