http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Go to 3.9 Unicode Encoding Forms. Or simply search D93
On 10/1/2011 2:21 PM, Ulf Zibis wrote:
Am 30.09.2011 22:46, schrieb Xueming Shen:
On 09/30/2011 07:09 AM, Ulf Zibis wrote:
(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return
the malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind was, that in case of malformed length 1 a consecutive
call to the decode loop would again return another malformed length
1, to ensure 2 replacement chars in the output string. (Not sure, if
that is expected in this corner case.)
Unicode Standard's "best practices" D93a/b recommends to return 2 in
this case.
Can you please give me a link for D93a/a. I don't know, where to find it.
3. Consider additionally 6795537 - UTF_8$Decoder returns wrong
results <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
I'm not sure I understand the suggested b1 < -0x3e patch, I don't
see we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).
You must see the b1 < -0x3e in combination with the following b1 <
-0x20. ;-)
But now I have a better "if...else if" switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable
I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to
2009(?) because
the benchmark shows the "shift" version is slightly faster.
IIRC this was only about a shift by multiples of 8 to ensure an 1-byte
comparison of 16/32-byte values in the double/quad-byte charsets.
Do you have any number
shows any difference now. My non-scientific benchmark still suggests
the "shift"
type is faster on -server vm, no significant difference on -client vm.
------------------ your new switch---------------
(1) -server
Method Millis Ratio
Decoding 1b UTF-8 : 125 1.000
Decoding 2b UTF-8 : 2558 20.443
Decoding 3b UTF-8 : 3439 27.481
Decoding 4b UTF-8 : 2030 16.221
(2) -client
Decoding 1b UTF-8 : 335 1.000
Decoding 2b UTF-8 : 1041 3.105
Decoding 3b UTF-8 : 2245 6.694
Decoding 4b UTF-8 : 1254 3.741
------------------ existing "shift"---------------
(1) -server
Decoding 1b UTF-8 : 134 1.000
Decoding 2b UTF-8 : 1891 14.106
Decoding 3b UTF-8 : 2934 21.886
Decoding 4b UTF-8 : 2133 15.913
(2) -client
Decoding 1b UTF-8 : 341 1.000
Decoding 2b UTF-8 : 949 2.560
Decoding 3b UTF-8 : 2321 6.255
Decoding 4b UTF-8 : 1278 3.446
Very interesting and surprising numbers!
The most surprising is, that the client compiler generates faster code
for 2..4-byte codes. I think, we should ask the HotSpot team for help.
As the UTF-8 de/encoding is a very frequent task, HotSpot should
provide compiled code as optimized best as possible for UTF-8 de/encoding.
Another surprise is, that benchmark for 1b UTF-8 is not same for "new
switch" and "shift" version, as the ASCII only loop is the same in
both versions.
To discover the miracle, why the"shift" version is little faster than
the "new switch" version, it should be helpful, to see the
disassembling of the HotSpot compiled code.
A third version, using the "(b1 & 0xe0) == 0xc0"/"(b1 & 0xf0) ==
0xe0"/"(b1 & 0xf8) == 0xf0" pattern, should be interesting toofor the
benchmark comparison.
In my opinion it would be more significant to compare x 1..4-byte
codes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codes
against 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codes
against 4000 bytes of 4-byte codes
We should document somewhere, that the ESU-8 decoder is faster than
the strong UTF-8 decoder for developers, who can ensure, that there
are no invalid surrogates in their source bytes.
-Ulf