Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-11-07 Thread Ulf Zibis

Hi Sherman,

2 nits again:
protected static final class Encoder extends CharsetEncoder
- If final, protected doesn't make sense, as it can't be subclassed. (ideally this should be a 
compiler error)

- Where is it used? Otherwise it could be private.

-Ulf


Am 07.11.2011 21:30, schrieb Xueming Shen:

Thanks Alan,

The webrev has been updated accordingly to address your comments (the
commented out isMalformed2 has been removed, copyright updated, bugid
added...)

And thanks for Ulf for the detailed review.

-Sherman

On 10/28/2011 09:14 AM, Alan Bateman wrote:

On 28/09/2011 20:18, Xueming Shen wrote:

Hi,

[I combined the proposed charge for #7082884, in which no one appears to be
interested:-) into this one]
:

http://cr.openjdk.java.net/~sherman/7096080/webrev/ 
http://cr.openjdk.java.net/%7Esherman/7096080/webrev/
I don't know if you are still looking for a reviewer for this (seems like Ulf has gone through 
this in detail, thanks Ulf).


Overall it looks fine to me. Minor comment is that in UTF_8.java then maybe isMalformed2 should 
be removed completely, maybe move some of the comment in the decode methods. Another minor nits 
is that the date on CESU_8.java is 2000-2010 where I assume it should be 2011. In Errors.java 
then it might be better to just remove L196 as it might confuse future maintainers. I would also 
suggest adding the bugID to the list of bugs in the tests too as someone these references are 
useful.


-Alan.





Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-11-07 Thread Xueming Shen

Hi Ulf,

Good catch. It was the leftover of my first drift, in which 
CESU.de/encoder is the

subclass of the UTF8 de/encoder.

webrev has been updated accordingly.

-Sherman

On 11/07/2011 12:57 PM, Ulf Zibis wrote:

Hi Sherman,

2 nits again:
protected static final class Encoder extends CharsetEncoder
- If final, protected doesn't make sense, as it can't be subclassed. 
(ideally this should be a compiler error)

- Where is it used? Otherwise it could be private.

-Ulf


Am 07.11.2011 21:30, schrieb Xueming Shen:

Thanks Alan,

The webrev has been updated accordingly to address your comments (the
commented out isMalformed2 has been removed, copyright updated, bugid
added...)

And thanks for Ulf for the detailed review.

-Sherman

On 10/28/2011 09:14 AM, Alan Bateman wrote:

On 28/09/2011 20:18, Xueming Shen wrote:

Hi,

[I combined the proposed charge for #7082884, in which no one 
appears to be

interested:-) into this one]
:

http://cr.openjdk.java.net/~sherman/7096080/webrev/ 
http://cr.openjdk.java.net/%7Esherman/7096080/webrev/
I don't know if you are still looking for a reviewer for this (seems 
like Ulf has gone through this in detail, thanks Ulf).


Overall it looks fine to me. Minor comment is that in UTF_8.java 
then maybe isMalformed2 should be removed completely, maybe move 
some of the comment in the decode methods. Another minor nits is 
that the date on CESU_8.java is 2000-2010 where I assume it should 
be 2011. In Errors.java then it might be better to just remove L196 
as it might confuse future maintainers. I would also suggest adding 
the bugID to the list of bugs in the tests too as someone these 
references are useful.


-Alan.







Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-28 Thread Alan Bateman

On 28/09/2011 20:18, Xueming Shen wrote:

Hi,

[I combined the proposed charge for #7082884, in which no one appears 
to be

interested:-) into this one]
:

http://cr.openjdk.java.net/~sherman/7096080/webrev/ 
http://cr.openjdk.java.net/%7Esherman/7096080/webrev/
I don't know if you are still looking for a reviewer for this (seems 
like Ulf has gone through this in detail, thanks Ulf).


Overall it looks fine to me. Minor comment is that in UTF_8.java then 
maybe isMalformed2 should be removed completely, maybe move some of the 
comment in the decode methods. Another minor nits is that the date on 
CESU_8.java is 2000-2010 where I assume it should be 2011. In 
Errors.java then it might be better to just remove L196 as it might 
confuse future maintainers. I would also suggest adding the bugID to the 
list of bugs in the tests too as someone these references are useful.


-Alan.


Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis

Am 13.10.2011 21:13, schrieb Xueming Shen:

On 10/13/2011 09:55 AM, Ulf Zibis wrote:

Am 11.10.2011 19:49, schrieb Xueming Shen:


I don't know which one is better, I did a run on

private static boolean op1(int b) {
return (b  6) != -2;
}
private static boolean op2(int b) {
return (b  0xc0) != 0x80;
}
private static boolean op3(byte b) {
 return b = (byte)0xc0;
}

with 100 iteration on my linux machine,  and got the scores

op1=1149
op2=1147
op3=1146

I would interpret it as they are identical.

Me too. thanks for your effort.
Maybe the comparison would differ on different architectures.

So I would prefer opt3, because the others ...
1. in question need 1 more CPU register to save the original value of b for 
later usage
2. need 1 more constant to load into CPU
and opt 3 ...
3. is the best readable source code
4. in question seems best to help Hotspot finding best optimization on 
arbitrary architectures.

5. is the smallest in bytecode footprint
6. so interpreter would be faster too.


I doubt it's more readable:-), given it's the byte operation means
0x80  = 0xc0 in int.

If b would be an unsigned int in range [0..0xFF], half yes (it would be: b0x80 || 
b=0xc0).
But b is in range [-0x80..0x7F] due to it's origin from a byte array, so the operation translated to 
int would be: b  -0x80 || b = -0x40



You need b to be byte for b = (byte)0xc0
No, it works as same for int, because the lower limit -0x80 will never be exceeded and (byte)0xc0 is 
-0x40.

So the notation b = (byte)0xc0 looks most close to its real unsigned 
counterpart.


to be the equivalent of 0x80  = 0xc0 and all use cases in UTF-8
existing implementation the b has been stored in int already.  Arguably
you can update the whole implementation to achieve this,

yes, that's exactly what I wanted to say.


but personally
I would like to just stick to the problem this proposal is trying to solve.

I agree, but it's not much more than declaring the bx as byte.



And, no, for the same reason I don't want to replace all (b  0xc0) != 0x80
by isNotContinuation(b), they just look fine for me, together with their
neighbors, such as 0x80  = 0xc0.
Yes, they look fine, but the reader always must put in mind, that (b  0xc0) != 0x80 is 
semantically same than isNotContinuation(b).
Why you introduce isNotContinuation(b) at all? It could always be inlined, as I don't think, the 
tiny operation has any effect on HotSpot's optimization strategy, and as a side effect, I guess C1 
code would be faster.


-Ulf




-Sherman



Additionally I guess using always byte variables would in question help HotSpot to optimize with 
1-byte-operand CPU instructions.


Don't you like to replace all (bx  0xc0) != 0x80 by isNotContinuation(bx) ?

-Ulf





Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis

Am 30.09.2011 22:46, schrieb Xueming Shen:

I believe we changed from (b1  xyz) to (b1  x) == -2 back to 2009(?) because
the benchmark shows the shift version is slightly faster. Do you have any 
number
shows any difference now. My non-scientific benchmark still suggests the shift
type is faster on -server vm, no significant difference on -client vm.

My new guess for the reason:
The unfolding of the bytes to int to serve the isNotContinuation / 
isMalformedxx methods.
So those methods should be coded in byte logic too.

But there remains the big question, why c1 is faster than c2, except for 1b.

-Ulf




  --  your new switch---
(1) -server
Method  Millis  Ratio
Decoding 1b UTF-8 :125  1.000
Decoding 2b UTF-8 :   2558 20.443
Decoding 3b UTF-8 :   3439 27.481
Decoding 4b UTF-8 :   2030 16.221
(2) -client
Decoding 1b UTF-8 :335  1.000
Decoding 2b UTF-8 :   1041  3.105
Decoding 3b UTF-8 :   2245  6.694
Decoding 4b UTF-8 :   1254  3.741

  -- existing shift---
(1) -server
Decoding 1b UTF-8 :134  1.000
Decoding 2b UTF-8 :   1891 14.106
Decoding 3b UTF-8 :   2934 21.886
Decoding 4b UTF-8 :   2133 15.913
(2) -client
Decoding 1b UTF-8 :341  1.000
Decoding 2b UTF-8 :949  2.560
Decoding 3b UTF-8 :   2321  6.255
Decoding 4b UTF-8 :   1278  3.446



-sherman



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-14 Thread Ulf Zibis

Am 14.10.2011 10:47, schrieb Ulf Zibis:

My new guess for the reason:
The unfolding of the bytes to int to serve the isNotContinuation / 
isMalformedxx methods.
So those methods should be coded in byte logic too.

+ use the bx = (byte)abc logic instead shift or (bx  abc) != def.

-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-13 Thread Ulf Zibis

Am 11.10.2011 19:49, schrieb Xueming Shen:


I don't know which one is better, I did a run on

private static boolean op1(int b) {
return (b  6) != -2;
}
private static boolean op2(int b) {
return (b  0xc0) != 0x80;
}
private static boolean op3(byte b) {
 return b = (byte)0xc0;
}

with 100 iteration on my linux machine,  and got the scores

op1=1149
op2=1147
op3=1146

I would interpret it as they are identical.

Me too. thanks for your effort.
Maybe the comparison would differ on different architectures.

So I would prefer opt3, because the others ...
1. in question need 1 more CPU register to save the original value of b for 
later usage
2. need 1 more constant to load into CPU
and opt 3 ...
3. is the best readable source code
4. in question seems best to help Hotspot finding best optimization on 
arbitrary architectures.

Additionally I guess using always byte variables would in question help HotSpot to optimize with 
1-byte-operand CPU instructions.


Don't you like to replace all (bx  0xc0) != 0x80 by isNotContinuation(bx) ?

-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-13 Thread Xueming Shen

On 10/13/2011 09:55 AM, Ulf Zibis wrote:

Am 11.10.2011 19:49, schrieb Xueming Shen:


I don't know which one is better, I did a run on

private static boolean op1(int b) {
return (b  6) != -2;
}
private static boolean op2(int b) {
return (b  0xc0) != 0x80;
}
private static boolean op3(byte b) {
 return b = (byte)0xc0;
}

with 100 iteration on my linux machine,  and got the scores

op1=1149
op2=1147
op3=1146

I would interpret it as they are identical.

Me too. thanks for your effort.
Maybe the comparison would differ on different architectures.

So I would prefer opt3, because the others ...
1. in question need 1 more CPU register to save the original value of 
b for later usage

2. need 1 more constant to load into CPU
and opt 3 ...
3. is the best readable source code
4. in question seems best to help Hotspot finding best optimization on 
arbitrary architectures.


I doubt it's more readable:-), given it's the byte operation means
0x80  = 0xc0 in int. You need b to be byte for b = (byte)0xc0
to be the equivalent of 0x80  = 0xc0 and all use cases in UTF-8
existing implementation the b has been stored in int already.  Arguably
you can update the whole implementation to achieve this, but personally
I would like to just stick to the problem this proposal is trying to solve.

And, no, for the same reason I don't want to replace all (b  0xc0) != 0x80
by isNotContinuation(b), they just look fine for me, together with their
neighbors, such as 0x80  = 0xc0.

-Sherman



Additionally I guess using always byte variables would in question 
help HotSpot to optimize with 1-byte-operand CPU instructions.


Don't you like to replace all (bx  0xc0) != 0x80 by 
isNotContinuation(bx) ?


-Ulf





Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Ulf Zibis

Hi Sherman,

I didn't read anything from you since longer time. You are in holidays?

Am 30.09.2011 22:46, schrieb Xueming Shen:


I believe we changed from (b1  xyz) to (b1  x) == -2 back to 2009(?) because
the benchmark shows the shift version is slightly faster. Do you have any 
number
shows any difference now. My non-scientific benchmark still suggests the shift
type is faster on -server vm, no significant difference on -client vm.


In this sense, then you should do the same here:
  87 private static boolean isNotContinuation(int b) {
  88 return (b  6) != -2;
  89 }

... + in all isMalformedxxx().

BTW, in all isMalformedxxx() you could replace all
(bx  0xc0) != 0x80
by
isNotContinuation(bx)
(would reduce the effort, checking the hex values manually, each time while 
reading)


Additionally:
Make private:
  75 private static final void updatePositions(
  76 Buffer src, int sp, Buffer dst, int dp) {


-Ulf




Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Ulf Zibis

Am 11.10.2011 13:36, schrieb Ulf Zibis:



I believe we changed from (b1  xyz) to (b1  x) == -2 back to 2009(?) because
the benchmark shows the shift version is slightly faster. Do you have any 
number
shows any difference now. My non-scientific benchmark still suggests the shift
type is faster on -server vm, no significant difference on -client vm.



In this sense, then you should do the same here:
  87 private static boolean isNotContinuation(int b) {
  88 return (b  6) != -2;
  89 }


Another 3rd variant to compare for performance:
(needs only 1 constant to load in cpu register, and is a 1-byte op code)

  87 private static boolean isNotContinuation(byte b) {
  88 return b = (byte)0xc0;
  89 }

-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-11 Thread Xueming Shen

On 10/11/2011 04:36 AM, Ulf Zibis wrote:

Am 30.09.2011 22:46, schrieb Xueming Shen:


I believe we changed from (b1  xyz) to (b1  x) == -2 back to 
2009(?) because
the benchmark shows the shift version is slightly faster. Do you 
have any number
shows any difference now. My non-scientific benchmark still suggests 
the shift

type is faster on -server vm, no significant difference on -client vm.


In this sense, then you should do the same here:
  87 private static boolean isNotContinuation(int b) {
  88 return (b  6) != -2;
  89 }


I don't know which one is better, I did a run on

private static boolean op1(int b) {
return (b  6) != -2;
}
private static boolean op2(int b) {
return (b  0xc0) != 0x80;
}
private static boolean op3(byte b) {
 return b = (byte)0xc0;
}

with 100 iteration on my linux machine,  and got the scores

op1=1149
op2=1147
op3=1146

I would interpret it as they are identical.




Additionally:
Make private:
  75 private static final void updatePositions(
  76 Buffer src, int sp, Buffer dst, int dp) {


updated accordingly

-Sherman


Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Xueming Shen

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Go to 3.9 Unicode Encoding Forms. Or simply search D93

On 10/1/2011 2:21 PM, Ulf Zibis wrote:

Am 30.09.2011 22:46, schrieb Xueming Shen:

On 09/30/2011 07:09 AM, Ulf Zibis wrote:


(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return 
the malformed length 2,

what UTF-8 is doing now, for this scenario
My idea behind was, that in case of malformed length 1 a consecutive 
call to the decode loop would again return another malformed length 
1, to ensure 2 replacement chars in the output string. (Not sure, if 
that is expected in this corner case.)


Unicode Standard's best practices D93a/b recommends to return 2 in 
this case.

Can you please give me a link for D93a/a. I don't know, where to find it.





3. Consider additionally 6795537 - UTF_8$Decoder returns wrong 
results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537



I'm not sure I  understand the suggested  b1  -0x3e patch, I don't 
see we can simply replace

((b1  5) == -2) with (b1  -0x3e).
You must see the b1  -0x3e in combination with the following b1  
-0x20. ;-)


But now I have a better if...else if switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable


I believe we changed from (b1  xyz) to (b1  x) == -2 back to 
2009(?) because

the benchmark shows the shift version is slightly faster.
IIRC this was only about a shift by multiples of 8 to ensure an 1-byte 
comparison of 16/32-byte values in the double/quad-byte charsets.




Do you have any number
shows any difference now. My non-scientific benchmark still suggests 
the shift

type is faster on -server vm, no significant difference on -client vm.

  --  your new switch---
(1) -server
Method  Millis  Ratio
Decoding 1b UTF-8 :125  1.000
Decoding 2b UTF-8 :   2558 20.443
Decoding 3b UTF-8 :   3439 27.481
Decoding 4b UTF-8 :   2030 16.221
(2) -client
Decoding 1b UTF-8 :335  1.000
Decoding 2b UTF-8 :   1041  3.105
Decoding 3b UTF-8 :   2245  6.694
Decoding 4b UTF-8 :   1254  3.741

  -- existing shift---
(1) -server
Decoding 1b UTF-8 :134  1.000
Decoding 2b UTF-8 :   1891 14.106
Decoding 3b UTF-8 :   2934 21.886
Decoding 4b UTF-8 :   2133 15.913
(2) -client
Decoding 1b UTF-8 :341  1.000
Decoding 2b UTF-8 :949  2.560
Decoding 3b UTF-8 :   2321  6.255
Decoding 4b UTF-8 :   1278  3.446


Very interesting and surprising numbers!
The most surprising is, that the client compiler generates faster code 
for 2..4-byte codes. I think, we should ask the HotSpot team for help. 
As the UTF-8 de/encoding is a very frequent task, HotSpot should 
provide compiled code as optimized best as possible for UTF-8 de/encoding.
Another surprise is, that benchmark for 1b UTF-8 is not same for new 
switch and shift version, as the ASCII only loop is the same in 
both versions.
To discover the miracle, why theshift version is little faster than 
the new switch version, it should be helpful, to see the 
disassembling of the HotSpot compiled code.
A third version, using the (b1  0xe0) == 0xc0/(b1  0xf0) == 
0xe0/(b1  0xf8) == 0xf0 pattern, should be interesting toofor the 
benchmark comparison.


In my opinion it would be more significant to compare x 1..4-byte 
codes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codes 
against 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codes 
against 4000 bytes of 4-byte codes


We should document somewhere, that the ESU-8 decoder is faster than 
the strong UTF-8 decoder for developers, who can ensure, that there 
are no invalid surrogates in their source bytes.


-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Ulf Zibis

Am 02.10.2011 08:29, schrieb Xueming Shen:

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Go to 3.9 Unicode Encoding Forms. Or simply search D93

On 10/1/2011 2:21 PM, Ulf Zibis wrote:

Am 30.09.2011 22:46, schrieb Xueming Shen:

On 09/30/2011 07:09 AM, Ulf Zibis wrote:


(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return the 
malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop 
would again return another malformed length 1, to ensure 2 replacement chars in the output 
string. (Not sure, if that is expected in this corner case.)


Unicode Standard's best practices D93a/b recommends to return 2 in this case.

OK, I got it:
E1 80 42 -- malformed length 2 -- 1 replacement -- FFFD 0042

Because for later understanding by others it could be difficult to find the right documents, it 
would be *very nice* to add this link to the souce code of UTF_8.java, by javadoc, or by simple doc.


-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-02 Thread Ulf Zibis

Hi again,

Am 30.09.2011 00:27, schrieb Xueming Shen:

On 09/29/2011 02:16 PM, Ulf Zibis wrote:


 280 if (Character.isSurrogate(c))
 281 return malformedForLength(src, sp, dst, dp, 3);
Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be interpreted 
again ?

Forget it! If c is a surrogate, b2 is in range A0..BF and b3 is in range 80..BF. Both can not be 
potentially well-formed as a first byte.




Actually I don't know the answer. My reading of D93a/D93b suggests that we might
interpret it as a whole, given the bytes are actually in well-formed byte 
pattern range
listed in Table 3.7, but ill-formed simply because they are surrogate value 
not scale
value, so I would interpret the whole 3 bytes as a maximal subpart. Given 
D93a/b is
best practices for Using U+fffd, either way should be fine. We do have 
Unicode expert
on the list, so maybe they can share their opinion on what is the 
desired/recommended
behavior in this case, from Standard point view?


At line 102 you could insert:
//  [E0] [A0..BF]
//  [E1..EF] [80..BF]

-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-10-01 Thread Ulf Zibis

Am 30.09.2011 22:46, schrieb Xueming Shen:

On 09/30/2011 07:09 AM, Ulf Zibis wrote:


(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return the 
malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop 
would again return another malformed length 1, to ensure 2 replacement chars in the output 
string. (Not sure, if that is expected in this corner case.)


Unicode Standard's best practices D93a/b recommends to return 2 in this case.

Can you please give me a link for D93a/a. I don't know, where to find it.





3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537




I'm not sure I  understand the suggested  b1  -0x3e patch, I don't see we can 
simply replace
((b1  5) == -2) with (b1  -0x3e).

You must see the b1  -0x3e in combination with the following b1  -0x20. ;-)

But now I have a better if...else if switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable


I believe we changed from (b1  xyz) to (b1  x) == -2 back to 2009(?) because
the benchmark shows the shift version is slightly faster.
IIRC this was only about a shift by multiples of 8 to ensure an 1-byte comparison of 16/32-byte 
values in the double/quad-byte charsets.




Do you have any number
shows any difference now. My non-scientific benchmark still suggests the shift
type is faster on -server vm, no significant difference on -client vm.

  --  your new switch---
(1) -server
Method  Millis  Ratio
Decoding 1b UTF-8 :125  1.000
Decoding 2b UTF-8 :   2558 20.443
Decoding 3b UTF-8 :   3439 27.481
Decoding 4b UTF-8 :   2030 16.221
(2) -client
Decoding 1b UTF-8 :335  1.000
Decoding 2b UTF-8 :   1041  3.105
Decoding 3b UTF-8 :   2245  6.694
Decoding 4b UTF-8 :   1254  3.741

  -- existing shift---
(1) -server
Decoding 1b UTF-8 :134  1.000
Decoding 2b UTF-8 :   1891 14.106
Decoding 3b UTF-8 :   2934 21.886
Decoding 4b UTF-8 :   2133 15.913
(2) -client
Decoding 1b UTF-8 :341  1.000
Decoding 2b UTF-8 :949  2.560
Decoding 3b UTF-8 :   2321  6.255
Decoding 4b UTF-8 :   1278  3.446


Very interesting and surprising numbers!
The most surprising is, that the client compiler generates faster code for 2..4-byte codes. I think, 
we should ask the HotSpot team for help. As the UTF-8 de/encoding is a very frequent task, HotSpot 
should provide compiled code as optimized best as possible for UTF-8 de/encoding.
Another surprise is, that benchmark for 1b UTF-8 is not same for new switch and shift version, 
as the ASCII only loop is the same in both versions.
To discover the miracle, why theshift version is little faster than the new switch version, it 
should be helpful, to see the disassembling of the HotSpot compiled code.
A third version, using the (b1  0xe0) == 0xc0/(b1  0xf0) == 0xe0/(b1  0xf8) == 0xf0 
pattern, should be interesting toofor the benchmark comparison.


In my opinion it would be more significant to compare x 1..4-byte codes than y bytes of 1..4-byte 
codes. I.e. 1000 bytes of 1-byte codes against 2000 bytes of 2-byte codes against 3000 bytes of 
3-byte codes against 4000 bytes of 4-byte codes


We should document somewhere, that the ESU-8 decoder is faster than the strong UTF-8 decoder for 
developers, who can ensure, that there are no invalid surrogates in their source bytes.


-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-30 Thread Ulf Zibis

Hi,

Am 29.09.2011 05:27, schrieb Xueming Shen:

Hi,

On 9/28/2011 3:44 PM, Ulf Zibis wrote
3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537




(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return the 
malformed length 2,
what UTF-8 is doing now, for this scenario
My idea behind is, that in case of malformed length 1 a consecutive call to the decode loop would 
again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not 
sure, if that is expected in this corner case.)




I'm not sure I  understand the suggested  b1  -0x3e patch, I don't see we can 
simply replace
((b1  5) == -2) with (b1  -0x3e).

You must see the b1  -0x3e in combination with the following b1  -0x20. ;-)

But now I have a better if...else if switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable

byte b1 = sa[sp]; // help compiler to benefit from 1 byte 
op-codes and constants
//byte b1 = src.get();// help compiler to benefit from 1 byte 
op-codes and constants
Byte1Switch:if (b1 = 0) {
// 1 byte, 7 bits: 0xxx
...return x;
} else if (b1  (byte)0xe0) {
// 2 bytes, 11 bits: 110x 10xx
if (b1  (byte)0xc2) // b1  C2 not legal
break Byte1Switch;
...return x;
} else if (b1  (byte)0xf0) {
// 3 bytes, 16 bits: 1110 10xx 10xx
...return x;
} else if (b1  (byte)0xf8) {
// 4 bytes, 21 bits: 0xxx 10xx 10xx 10xx
...return x;
}
return malformed(src, sp, dst, dp, 1);
//return malformed(src, mark, 1);



Anyway, I hope now you are motivated to take a deep look at the code:-) and 
maybe want to
run all your tests to confirm the change is fine.


At the moment I don't have a well running system, so my contribution must 
remain limited. :-(

About motivation:
For me it's kinda frustrating, seeing a bug from external voluntary contributor as Will Not Fix, 
but some time later an Oracle employée creates an equivalent new bug from scratch without at least 
referring to the existing ones, e.g.: 6798514, 6795537.


Additionally I think, now it's the right time to re-evaluate bug 4508058 - UTF-8 encoding does not 
recognize initial BOM.


Cheers,

-Ulf




Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-30 Thread Xueming Shen

On 09/30/2011 07:09 AM, Ulf Zibis wrote:




(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return 
the malformed length 2,

what UTF-8 is doing now, for this scenario
My idea behind is, that in case of malformed length 1 a consecutive 
call to the decode loop would again return another malformed length 1, 
to ensure 2 replacement chars in the output string. (Not sure, if that 
is expected in this corner case.)


Unicode Standard's best practices D93a/b recommends to return 2 in 
this case.



3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537



I'm not sure I  understand the suggested  b1  -0x3e patch, I don't 
see we can simply replace

((b1  5) == -2) with (b1  -0x3e).
You must see the b1  -0x3e in combination with the following b1  
-0x20. ;-)


But now I have a better if...else if switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable


I believe we changed from (b1  xyz) to (b1  x) == -2 back to 2009(?) 
because
the benchmark shows the shift version is slightly faster. Do you have 
any number
shows any difference now. My non-scientific benchmark still suggests the 
shift

type is faster on -server vm, no significant difference on -client vm.

  --  your new switch---
(1) -server
Method  Millis  Ratio
Decoding 1b UTF-8 :125  1.000
Decoding 2b UTF-8 :   2558 20.443
Decoding 3b UTF-8 :   3439 27.481
Decoding 4b UTF-8 :   2030 16.221
(2) -client
Decoding 1b UTF-8 :335  1.000
Decoding 2b UTF-8 :   1041  3.105
Decoding 3b UTF-8 :   2245  6.694
Decoding 4b UTF-8 :   1254  3.741

  -- existing shift---
(1) -server
Decoding 1b UTF-8 :134  1.000
Decoding 2b UTF-8 :   1891 14.106
Decoding 3b UTF-8 :   2934 21.886
Decoding 4b UTF-8 :   2133 15.913
(2) -client
Decoding 1b UTF-8 :341  1.000
Decoding 2b UTF-8 :949  2.560
Decoding 3b UTF-8 :   2321  6.255
Decoding 4b UTF-8 :   1278  3.446



-sherman



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-29 Thread Xueming Shen

On 09/29/2011 02:16 PM, Ulf Zibis wrote:

Please use spaces with ternary operators: Lines 155, 216

For short you could use sr instead srcRemaining, consistent to sa, sp, sl.

 420 // returns -1 if there is malformed byte(s) and the
better:
 420 // returns -1 if there is/are malformed byte(s) and the

 466 sp -=3;
There should be a space:  sp -= 3;


Webrev has been updated accordingly.



 280 if (Character.isSurrogate(c))
 281 return malformedForLength(src, sp, dst, 
dp, 3);
Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be 
interpreted again ?




Actually I don't know the answer. My reading of D93a/D93b suggests that 
we might
interpret it as a whole, given the bytes are actually in well-formed 
byte pattern range
listed in Table 3.7, but ill-formed simply because they are surrogate 
value not scale
value, so I would interpret the whole 3 bytes as a maximal subpart. 
Given D93a/b is
best practices for Using U+fffd, either way should be fine. We do have 
Unicode expert
on the list, so maybe they can share their opinion on what is the 
desired/recommended

behavior in this case, from Standard point view?



Am 29.09.2011 05:27, schrieb Xueming Shen:

Hi,

On 9/28/2011 3:44 PM, Ulf Zibis wrote:
5. IMHO charset CESU-8 should be hosted in extended-charsets, 
otherwise it should be added to java.nio.StandardCharsets




We have lots of charsets provided via the standard charset provider 
(in rt.jar) but not listed in StandardCharsets.
Yes, but the reasonable to add CESU-8 to StandardCharsets was the 
supposed demand to treat all unicode charsets equivalent.


Otherwise there is no obstacle to host CESU-8 in extended-charsets.
IMHO, CESU-8 addresses corner case compatibility issues, but not 
standard requirements.


To put CESU-8 into standard charset provider (it is only an 
implementation details) does
not mean it is a standard requirement, it just means it is bundled 
into rt.jar. The reason
I put it there is to make sure it is together with the UTF-8, with the 
assumption is that you
might need it around when using the updated UTF-8, which no longer 
handles those 3/6-byte

surrogates.

-Sherman




Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Xueming Shen

Hi,

[I combined the proposed charge for #7082884, in which no one appears to be
interested:-) into this one]

Unicode Standard added Addition Constraints on conversion of ill-formed 
UTF-8
in version 5.1 [1] and updated in 6.0 again with further clarification 
[2] regarding

how a conformance implementation should handle ill-formed UTF-8 byte
sequence. Basically it says

 (1) the conversion process should not interpret any ill-formed code 
unit sequence
 (2) such process must not treat any adjacent well-formed code unit 
sequences

 as being part of those ill-formed code unit sequences
 (3) and recommend a best practice of maximal valid sub-part for 
replacement


The new UTF-8 charset implementation we put in JDK7 (and back-ported  to 
previous

release since then) follows the new constraints in most cases, except

(1) The decoder still accepts historical 3 bytes surrogates and 6 
bytes surrogate
pair (the encoder never outputs such sequence). Unicode Standard 
tightened its

UTF-8 definition in ver 3.2 [3], as

Most notable among the corrigenda to the Standard is a further 
tightening
 of the definition of UTF-8, to eliminate irregular UTF-8 and to 
bring the

 Unicode specification of UTF-8 more completely into line with other
 specifications of UTF-8.

So the 3-byte/6-byte surrogates are now defined as ill-formed code unit
sequence, instead of irregular [5] in ver 3.1

(2) While no longer accepting the historical 5-byte, 6-byte UTF-8 byte 
sequence,
the decoder treats these 5/6-byte sequence as ONE malformed unit. As a 
result

these bytes get replaced by one replacement character, when replace for
malformed is desirable (as in new String(bytes), for example). 
According the
latest Unicode standard [2], however, because the leading byte of these 
5/6-byte
sequence is no longer an illegal appearance of the UTF-8, these bytes 
should be

treated as 5-6 individual ill-formed bytes.

(3)Corner case like ill-formed byte sequence ED 31 is not handled correctly/
consistently, as described in #7082884 [6]

The reason behind (1) and (2) is mostly the compatibility concern. As 
suggested
in TR#26 [4] (in which it defines CESU-8, a separate UTF encoding scheme 
that
uses 3-6-byte sequence for supplementary characters, instead of 4-byte 
sequence
in UTF-8), there are apps/data over there that do use surrogates pair in 
UTF-8

form. To change the UTF-8 charset to follow standard obviously will break
someone's code when they migrate/upgrade from JDK/JRE N to N+1, something
we try really hard to avoid.

That said, given almost decade has passed and we are now at Unicode 6, I 
think

the possibility of breaking someone's code/date of upgrading UTF-8 to do the
right thing is small/minor. So I proposed here

(1) to upgrade the JDK8 UTF-8 implementation to strictly follow the 
standard to

 a) reject 3-byte surrogate/6-byte surrogate pair
 b) treats 5/6-byte surrogate as individual ill-formed bytes.
 c) fix the corner case bug #7082884
(2) to add CESU-8 charset into JDK/JRE's charset repository (for those still
prefer/work on 3-6 bytes surrogate, in UTF-8 form)

Here is the webrev. The change will need to go through some in-compatible
change review process, but I think we can start the code review/discussion
here first.

http://cr.openjdk.java.net/~sherman/7096080/webrev/ 
http://cr.openjdk.java.net/%7Esherman/7096080/webrev/


-Sherman

[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://www.unicode.org/reports/tr28/tr28-3.html
[4] http://unicode.org/reports/tr26/
[5] http://unicode.org/versions/corrigendum1.html
[6] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-September/007722.html




Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Ulf Zibis

Hi Sherman,

1. bug 7096080 is not visible at 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080

2. bug 7096080 seems to be a duplicate of 6798514 - Charset UTF-8 accepts CESU-8 codings 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6798514 which was closed. It should be reopened 
again.


3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537


4. Consider additionally 6184334 - Massive duplication of code in contains(Charset) methods in UTF-* 
Charsets http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6184334


5. IMHO charset CESU-8 should be hosted in extended-charsets, otherwise it should be added to 
java.nio.StandardCharsets


-Ulf



Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset

2011-09-28 Thread Xueming Shen

Hi,

On 9/28/2011 3:44 PM, Ulf Zibis wrote:

Hi Sherman,

1. bug 7096080 is not visible at 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080


It might take couple days for it to show up on bugs.sun.com. But it has 
exactly the same content as

my previous email. In fact I simply copy/pasted them into email.

3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537




(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return the 
malformed length 2,

what UTF-8 is doing now, for this scenario

(2)  new byte[]{(byte)0xE1, (byte)0x40 --- 
CoderResult.malformedForLength(1)

The change proposed actually fixed this one already (malformed length 1

(3) new byte[]{(byte)0xC0} --- CoderResult.malformedForLength(1)
Technically this is not a bug,  the decoder will return malformedlength 
1 if you go with
decode(bf,cf, true). But yes, it would be desirable to return malformed 
length 1 without
waiting for second byte. The code/webrev has been updated to just do 
this as expected.


Now the 2-byte sequence entry check has been updated to
} else if ((b1  5) == -2  (b1  0x1e) != 0) {
...
}

and I no longer check the first byte for malformed2(),

in which I think has the smallest performance impact for 2 bytes 
sequence. I ran several
rounds of benchmark testing, I did not see significant difference. I 
will try more later.


I'm not sure I  understand the suggested  b1  -0x3e patch, I don't see 
we can simply replace

((b1  5) == -2) with (b1  -0x3e).

Anyway, I hope now you are motivated to take a deep look at the code:-) 
and maybe want to

run all your tests to confirm the change is fine.

This change does expose an existing bug/issue in StreamDecoder, in which 
the StreamDecoder fails
to replace a malformed input, in which a leading byte  is at the end 
of the stream. This is why

I commended the line in Errors. I will file a bug for this one later.

5. IMHO charset CESU-8 should be hosted in extended-charsets, 
otherwise it should be added to java.nio.StandardCharsets




We have lots of charsets provided via the standard charset provider 
(in rt.jar) but not listed in StandardCharsets.


-Sherman