Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi Sherman, 2 nits again: protected static final class Encoder extends CharsetEncoder - If final, protected doesn't make sense, as it can't be subclassed. (ideally this should be a compiler error) - Where is it used? Otherwise it could be private. -Ulf Am 07.11.2011 21:30, schrieb Xueming Shen: Thanks Alan, The webrev has been updated accordingly to address your comments (the commented out isMalformed2 has been removed, copyright updated, bugid added...) And thanks for Ulf for the detailed review. -Sherman On 10/28/2011 09:14 AM, Alan Bateman wrote: On 28/09/2011 20:18, Xueming Shen wrote: Hi, [I combined the proposed charge for #7082884, in which no one appears to be interested:-) into this one] : http://cr.openjdk.java.net/~sherman/7096080/webrev/ http://cr.openjdk.java.net/%7Esherman/7096080/webrev/ I don't know if you are still looking for a reviewer for this (seems like Ulf has gone through this in detail, thanks Ulf). Overall it looks fine to me. Minor comment is that in UTF_8.java then maybe isMalformed2 should be removed completely, maybe move some of the comment in the decode methods. Another minor nits is that the date on CESU_8.java is 2000-2010 where I assume it should be 2011. In Errors.java then it might be better to just remove L196 as it might confuse future maintainers. I would also suggest adding the bugID to the list of bugs in the tests too as someone these references are useful. -Alan.
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi Ulf, Good catch. It was the leftover of my first drift, in which CESU.de/encoder is the subclass of the UTF8 de/encoder. webrev has been updated accordingly. -Sherman On 11/07/2011 12:57 PM, Ulf Zibis wrote: Hi Sherman, 2 nits again: protected static final class Encoder extends CharsetEncoder - If final, protected doesn't make sense, as it can't be subclassed. (ideally this should be a compiler error) - Where is it used? Otherwise it could be private. -Ulf Am 07.11.2011 21:30, schrieb Xueming Shen: Thanks Alan, The webrev has been updated accordingly to address your comments (the commented out isMalformed2 has been removed, copyright updated, bugid added...) And thanks for Ulf for the detailed review. -Sherman On 10/28/2011 09:14 AM, Alan Bateman wrote: On 28/09/2011 20:18, Xueming Shen wrote: Hi, [I combined the proposed charge for #7082884, in which no one appears to be interested:-) into this one] : http://cr.openjdk.java.net/~sherman/7096080/webrev/ http://cr.openjdk.java.net/%7Esherman/7096080/webrev/ I don't know if you are still looking for a reviewer for this (seems like Ulf has gone through this in detail, thanks Ulf). Overall it looks fine to me. Minor comment is that in UTF_8.java then maybe isMalformed2 should be removed completely, maybe move some of the comment in the decode methods. Another minor nits is that the date on CESU_8.java is 2000-2010 where I assume it should be 2011. In Errors.java then it might be better to just remove L196 as it might confuse future maintainers. I would also suggest adding the bugID to the list of bugs in the tests too as someone these references are useful. -Alan.
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
On 28/09/2011 20:18, Xueming Shen wrote: Hi, [I combined the proposed charge for #7082884, in which no one appears to be interested:-) into this one] : http://cr.openjdk.java.net/~sherman/7096080/webrev/ http://cr.openjdk.java.net/%7Esherman/7096080/webrev/ I don't know if you are still looking for a reviewer for this (seems like Ulf has gone through this in detail, thanks Ulf). Overall it looks fine to me. Minor comment is that in UTF_8.java then maybe isMalformed2 should be removed completely, maybe move some of the comment in the decode methods. Another minor nits is that the date on CESU_8.java is 2000-2010 where I assume it should be 2011. In Errors.java then it might be better to just remove L196 as it might confuse future maintainers. I would also suggest adding the bugID to the list of bugs in the tests too as someone these references are useful. -Alan.
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 13.10.2011 21:13, schrieb Xueming Shen: On 10/13/2011 09:55 AM, Ulf Zibis wrote: Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b 6) != -2; } private static boolean op2(int b) { return (b 0xc0) != 0x80; } private static boolean op3(byte b) { return b = (byte)0xc0; } with 100 iteration on my linux machine, and got the scores op1=1149 op2=1147 op3=1146 I would interpret it as they are identical. Me too. thanks for your effort. Maybe the comparison would differ on different architectures. So I would prefer opt3, because the others ... 1. in question need 1 more CPU register to save the original value of b for later usage 2. need 1 more constant to load into CPU and opt 3 ... 3. is the best readable source code 4. in question seems best to help Hotspot finding best optimization on arbitrary architectures. 5. is the smallest in bytecode footprint 6. so interpreter would be faster too. I doubt it's more readable:-), given it's the byte operation means 0x80 = 0xc0 in int. If b would be an unsigned int in range [0..0xFF], half yes (it would be: b0x80 || b=0xc0). But b is in range [-0x80..0x7F] due to it's origin from a byte array, so the operation translated to int would be: b -0x80 || b = -0x40 You need b to be byte for b = (byte)0xc0 No, it works as same for int, because the lower limit -0x80 will never be exceeded and (byte)0xc0 is -0x40. So the notation b = (byte)0xc0 looks most close to its real unsigned counterpart. to be the equivalent of 0x80 = 0xc0 and all use cases in UTF-8 existing implementation the b has been stored in int already. Arguably you can update the whole implementation to achieve this, yes, that's exactly what I wanted to say. but personally I would like to just stick to the problem this proposal is trying to solve. I agree, but it's not much more than declaring the bx as byte. And, no, for the same reason I don't want to replace all (b 0xc0) != 0x80 by isNotContinuation(b), they just look fine for me, together with their neighbors, such as 0x80 = 0xc0. Yes, they look fine, but the reader always must put in mind, that (b 0xc0) != 0x80 is semantically same than isNotContinuation(b). Why you introduce isNotContinuation(b) at all? It could always be inlined, as I don't think, the tiny operation has any effect on HotSpot's optimization strategy, and as a side effect, I guess C1 code would be faster. -Ulf -Sherman Additionally I guess using always byte variables would in question help HotSpot to optimize with 1-byte-operand CPU instructions. Don't you like to replace all (bx 0xc0) != 0x80 by isNotContinuation(bx) ? -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. My new guess for the reason: The unfolding of the bytes to int to serve the isNotContinuation / isMalformedxx methods. So those methods should be coded in byte logic too. But there remains the big question, why c1 is faster than c2, except for 1b. -Ulf -- your new switch--- (1) -server Method Millis Ratio Decoding 1b UTF-8 :125 1.000 Decoding 2b UTF-8 : 2558 20.443 Decoding 3b UTF-8 : 3439 27.481 Decoding 4b UTF-8 : 2030 16.221 (2) -client Decoding 1b UTF-8 :335 1.000 Decoding 2b UTF-8 : 1041 3.105 Decoding 3b UTF-8 : 2245 6.694 Decoding 4b UTF-8 : 1254 3.741 -- existing shift--- (1) -server Decoding 1b UTF-8 :134 1.000 Decoding 2b UTF-8 : 1891 14.106 Decoding 3b UTF-8 : 2934 21.886 Decoding 4b UTF-8 : 2133 15.913 (2) -client Decoding 1b UTF-8 :341 1.000 Decoding 2b UTF-8 :949 2.560 Decoding 3b UTF-8 : 2321 6.255 Decoding 4b UTF-8 : 1278 3.446 -sherman
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 14.10.2011 10:47, schrieb Ulf Zibis: My new guess for the reason: The unfolding of the bytes to int to serve the isNotContinuation / isMalformedxx methods. So those methods should be coded in byte logic too. + use the bx = (byte)abc logic instead shift or (bx abc) != def. -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b 6) != -2; } private static boolean op2(int b) { return (b 0xc0) != 0x80; } private static boolean op3(byte b) { return b = (byte)0xc0; } with 100 iteration on my linux machine, and got the scores op1=1149 op2=1147 op3=1146 I would interpret it as they are identical. Me too. thanks for your effort. Maybe the comparison would differ on different architectures. So I would prefer opt3, because the others ... 1. in question need 1 more CPU register to save the original value of b for later usage 2. need 1 more constant to load into CPU and opt 3 ... 3. is the best readable source code 4. in question seems best to help Hotspot finding best optimization on arbitrary architectures. Additionally I guess using always byte variables would in question help HotSpot to optimize with 1-byte-operand CPU instructions. Don't you like to replace all (bx 0xc0) != 0x80 by isNotContinuation(bx) ? -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
On 10/13/2011 09:55 AM, Ulf Zibis wrote: Am 11.10.2011 19:49, schrieb Xueming Shen: I don't know which one is better, I did a run on private static boolean op1(int b) { return (b 6) != -2; } private static boolean op2(int b) { return (b 0xc0) != 0x80; } private static boolean op3(byte b) { return b = (byte)0xc0; } with 100 iteration on my linux machine, and got the scores op1=1149 op2=1147 op3=1146 I would interpret it as they are identical. Me too. thanks for your effort. Maybe the comparison would differ on different architectures. So I would prefer opt3, because the others ... 1. in question need 1 more CPU register to save the original value of b for later usage 2. need 1 more constant to load into CPU and opt 3 ... 3. is the best readable source code 4. in question seems best to help Hotspot finding best optimization on arbitrary architectures. I doubt it's more readable:-), given it's the byte operation means 0x80 = 0xc0 in int. You need b to be byte for b = (byte)0xc0 to be the equivalent of 0x80 = 0xc0 and all use cases in UTF-8 existing implementation the b has been stored in int already. Arguably you can update the whole implementation to achieve this, but personally I would like to just stick to the problem this proposal is trying to solve. And, no, for the same reason I don't want to replace all (b 0xc0) != 0x80 by isNotContinuation(b), they just look fine for me, together with their neighbors, such as 0x80 = 0xc0. -Sherman Additionally I guess using always byte variables would in question help HotSpot to optimize with 1-byte-operand CPU instructions. Don't you like to replace all (bx 0xc0) != 0x80 by isNotContinuation(bx) ? -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi Sherman, I didn't read anything from you since longer time. You are in holidays? Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. In this sense, then you should do the same here: 87 private static boolean isNotContinuation(int b) { 88 return (b 6) != -2; 89 } ... + in all isMalformedxxx(). BTW, in all isMalformedxxx() you could replace all (bx 0xc0) != 0x80 by isNotContinuation(bx) (would reduce the effort, checking the hex values manually, each time while reading) Additionally: Make private: 75 private static final void updatePositions( 76 Buffer src, int sp, Buffer dst, int dp) { -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 11.10.2011 13:36, schrieb Ulf Zibis: I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. In this sense, then you should do the same here: 87 private static boolean isNotContinuation(int b) { 88 return (b 6) != -2; 89 } Another 3rd variant to compare for performance: (needs only 1 constant to load in cpu register, and is a 1-byte op code) 87 private static boolean isNotContinuation(byte b) { 88 return b = (byte)0xc0; 89 } -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
On 10/11/2011 04:36 AM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. In this sense, then you should do the same here: 87 private static boolean isNotContinuation(int b) { 88 return (b 6) != -2; 89 } I don't know which one is better, I did a run on private static boolean op1(int b) { return (b 6) != -2; } private static boolean op2(int b) { return (b 0xc0) != 0x80; } private static boolean op3(byte b) { return b = (byte)0xc0; } with 100 iteration on my linux machine, and got the scores op1=1149 op2=1147 op3=1146 I would interpret it as they are identical. Additionally: Make private: 75 private static final void updatePositions( 76 Buffer src, int sp, Buffer dst, int dp) { updated accordingly -Sherman
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Go to 3.9 Unicode Encoding Forms. Or simply search D93 On 10/1/2011 2:21 PM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop would again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not sure, if that is expected in this corner case.) Unicode Standard's best practices D93a/b recommends to return 2 in this case. Can you please give me a link for D93a/a. I don't know, where to find it. 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537 I'm not sure I understand the suggested b1 -0x3e patch, I don't see we can simply replace ((b1 5) == -2) with (b1 -0x3e). You must see the b1 -0x3e in combination with the following b1 -0x20. ;-) But now I have a better if...else if switch. :-) - saves the shift operations - only 1 comparison per case - only 1 constant to load per case - helps compiler to benefit from 1 byte constants and op-codes - much better readable I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. IIRC this was only about a shift by multiples of 8 to ensure an 1-byte comparison of 16/32-byte values in the double/quad-byte charsets. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. -- your new switch--- (1) -server Method Millis Ratio Decoding 1b UTF-8 :125 1.000 Decoding 2b UTF-8 : 2558 20.443 Decoding 3b UTF-8 : 3439 27.481 Decoding 4b UTF-8 : 2030 16.221 (2) -client Decoding 1b UTF-8 :335 1.000 Decoding 2b UTF-8 : 1041 3.105 Decoding 3b UTF-8 : 2245 6.694 Decoding 4b UTF-8 : 1254 3.741 -- existing shift--- (1) -server Decoding 1b UTF-8 :134 1.000 Decoding 2b UTF-8 : 1891 14.106 Decoding 3b UTF-8 : 2934 21.886 Decoding 4b UTF-8 : 2133 15.913 (2) -client Decoding 1b UTF-8 :341 1.000 Decoding 2b UTF-8 :949 2.560 Decoding 3b UTF-8 : 2321 6.255 Decoding 4b UTF-8 : 1278 3.446 Very interesting and surprising numbers! The most surprising is, that the client compiler generates faster code for 2..4-byte codes. I think, we should ask the HotSpot team for help. As the UTF-8 de/encoding is a very frequent task, HotSpot should provide compiled code as optimized best as possible for UTF-8 de/encoding. Another surprise is, that benchmark for 1b UTF-8 is not same for new switch and shift version, as the ASCII only loop is the same in both versions. To discover the miracle, why theshift version is little faster than the new switch version, it should be helpful, to see the disassembling of the HotSpot compiled code. A third version, using the (b1 0xe0) == 0xc0/(b1 0xf0) == 0xe0/(b1 0xf8) == 0xf0 pattern, should be interesting toofor the benchmark comparison. In my opinion it would be more significant to compare x 1..4-byte codes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codes against 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codes against 4000 bytes of 4-byte codes We should document somewhere, that the ESU-8 decoder is faster than the strong UTF-8 decoder for developers, who can ensure, that there are no invalid surrogates in their source bytes. -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 02.10.2011 08:29, schrieb Xueming Shen: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Go to 3.9 Unicode Encoding Forms. Or simply search D93 On 10/1/2011 2:21 PM, Ulf Zibis wrote: Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop would again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not sure, if that is expected in this corner case.) Unicode Standard's best practices D93a/b recommends to return 2 in this case. OK, I got it: E1 80 42 -- malformed length 2 -- 1 replacement -- FFFD 0042 Because for later understanding by others it could be difficult to find the right documents, it would be *very nice* to add this link to the souce code of UTF_8.java, by javadoc, or by simple doc. -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi again, Am 30.09.2011 00:27, schrieb Xueming Shen: On 09/29/2011 02:16 PM, Ulf Zibis wrote: 280 if (Character.isSurrogate(c)) 281 return malformedForLength(src, sp, dst, dp, 3); Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be interpreted again ? Forget it! If c is a surrogate, b2 is in range A0..BF and b3 is in range 80..BF. Both can not be potentially well-formed as a first byte. Actually I don't know the answer. My reading of D93a/D93b suggests that we might interpret it as a whole, given the bytes are actually in well-formed byte pattern range listed in Table 3.7, but ill-formed simply because they are surrogate value not scale value, so I would interpret the whole 3 bytes as a maximal subpart. Given D93a/b is best practices for Using U+fffd, either way should be fine. We do have Unicode expert on the list, so maybe they can share their opinion on what is the desired/recommended behavior in this case, from Standard point view? At line 102 you could insert: // [E0] [A0..BF] // [E1..EF] [80..BF] -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Am 30.09.2011 22:46, schrieb Xueming Shen: On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop would again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not sure, if that is expected in this corner case.) Unicode Standard's best practices D93a/b recommends to return 2 in this case. Can you please give me a link for D93a/a. I don't know, where to find it. 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537 I'm not sure I understand the suggested b1 -0x3e patch, I don't see we can simply replace ((b1 5) == -2) with (b1 -0x3e). You must see the b1 -0x3e in combination with the following b1 -0x20. ;-) But now I have a better if...else if switch. :-) - saves the shift operations - only 1 comparison per case - only 1 constant to load per case - helps compiler to benefit from 1 byte constants and op-codes - much better readable I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. IIRC this was only about a shift by multiples of 8 to ensure an 1-byte comparison of 16/32-byte values in the double/quad-byte charsets. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. -- your new switch--- (1) -server Method Millis Ratio Decoding 1b UTF-8 :125 1.000 Decoding 2b UTF-8 : 2558 20.443 Decoding 3b UTF-8 : 3439 27.481 Decoding 4b UTF-8 : 2030 16.221 (2) -client Decoding 1b UTF-8 :335 1.000 Decoding 2b UTF-8 : 1041 3.105 Decoding 3b UTF-8 : 2245 6.694 Decoding 4b UTF-8 : 1254 3.741 -- existing shift--- (1) -server Decoding 1b UTF-8 :134 1.000 Decoding 2b UTF-8 : 1891 14.106 Decoding 3b UTF-8 : 2934 21.886 Decoding 4b UTF-8 : 2133 15.913 (2) -client Decoding 1b UTF-8 :341 1.000 Decoding 2b UTF-8 :949 2.560 Decoding 3b UTF-8 : 2321 6.255 Decoding 4b UTF-8 : 1278 3.446 Very interesting and surprising numbers! The most surprising is, that the client compiler generates faster code for 2..4-byte codes. I think, we should ask the HotSpot team for help. As the UTF-8 de/encoding is a very frequent task, HotSpot should provide compiled code as optimized best as possible for UTF-8 de/encoding. Another surprise is, that benchmark for 1b UTF-8 is not same for new switch and shift version, as the ASCII only loop is the same in both versions. To discover the miracle, why theshift version is little faster than the new switch version, it should be helpful, to see the disassembling of the HotSpot compiled code. A third version, using the (b1 0xe0) == 0xc0/(b1 0xf0) == 0xe0/(b1 0xf8) == 0xf0 pattern, should be interesting toofor the benchmark comparison. In my opinion it would be more significant to compare x 1..4-byte codes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codes against 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codes against 4000 bytes of 4-byte codes We should document somewhere, that the ESU-8 decoder is faster than the strong UTF-8 decoder for developers, who can ensure, that there are no invalid surrogates in their source bytes. -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi, Am 29.09.2011 05:27, schrieb Xueming Shen: Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537 (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind is, that in case of malformed length 1 a consecutive call to the decode loop would again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not sure, if that is expected in this corner case.) I'm not sure I understand the suggested b1 -0x3e patch, I don't see we can simply replace ((b1 5) == -2) with (b1 -0x3e). You must see the b1 -0x3e in combination with the following b1 -0x20. ;-) But now I have a better if...else if switch. :-) - saves the shift operations - only 1 comparison per case - only 1 constant to load per case - helps compiler to benefit from 1 byte constants and op-codes - much better readable byte b1 = sa[sp]; // help compiler to benefit from 1 byte op-codes and constants //byte b1 = src.get();// help compiler to benefit from 1 byte op-codes and constants Byte1Switch:if (b1 = 0) { // 1 byte, 7 bits: 0xxx ...return x; } else if (b1 (byte)0xe0) { // 2 bytes, 11 bits: 110x 10xx if (b1 (byte)0xc2) // b1 C2 not legal break Byte1Switch; ...return x; } else if (b1 (byte)0xf0) { // 3 bytes, 16 bits: 1110 10xx 10xx ...return x; } else if (b1 (byte)0xf8) { // 4 bytes, 21 bits: 0xxx 10xx 10xx 10xx ...return x; } return malformed(src, sp, dst, dp, 1); //return malformed(src, mark, 1); Anyway, I hope now you are motivated to take a deep look at the code:-) and maybe want to run all your tests to confirm the change is fine. At the moment I don't have a well running system, so my contribution must remain limited. :-( About motivation: For me it's kinda frustrating, seeing a bug from external voluntary contributor as Will Not Fix, but some time later an Oracle employée creates an equivalent new bug from scratch without at least referring to the existing ones, e.g.: 6798514, 6795537. Additionally I think, now it's the right time to re-evaluate bug 4508058 - UTF-8 encoding does not recognize initial BOM. Cheers, -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
On 09/30/2011 07:09 AM, Ulf Zibis wrote: (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario My idea behind is, that in case of malformed length 1 a consecutive call to the decode loop would again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not sure, if that is expected in this corner case.) Unicode Standard's best practices D93a/b recommends to return 2 in this case. 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537 I'm not sure I understand the suggested b1 -0x3e patch, I don't see we can simply replace ((b1 5) == -2) with (b1 -0x3e). You must see the b1 -0x3e in combination with the following b1 -0x20. ;-) But now I have a better if...else if switch. :-) - saves the shift operations - only 1 comparison per case - only 1 constant to load per case - helps compiler to benefit from 1 byte constants and op-codes - much better readable I believe we changed from (b1 xyz) to (b1 x) == -2 back to 2009(?) because the benchmark shows the shift version is slightly faster. Do you have any number shows any difference now. My non-scientific benchmark still suggests the shift type is faster on -server vm, no significant difference on -client vm. -- your new switch--- (1) -server Method Millis Ratio Decoding 1b UTF-8 :125 1.000 Decoding 2b UTF-8 : 2558 20.443 Decoding 3b UTF-8 : 3439 27.481 Decoding 4b UTF-8 : 2030 16.221 (2) -client Decoding 1b UTF-8 :335 1.000 Decoding 2b UTF-8 : 1041 3.105 Decoding 3b UTF-8 : 2245 6.694 Decoding 4b UTF-8 : 1254 3.741 -- existing shift--- (1) -server Decoding 1b UTF-8 :134 1.000 Decoding 2b UTF-8 : 1891 14.106 Decoding 3b UTF-8 : 2934 21.886 Decoding 4b UTF-8 : 2133 15.913 (2) -client Decoding 1b UTF-8 :341 1.000 Decoding 2b UTF-8 :949 2.560 Decoding 3b UTF-8 : 2321 6.255 Decoding 4b UTF-8 : 1278 3.446 -sherman
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
On 09/29/2011 02:16 PM, Ulf Zibis wrote: Please use spaces with ternary operators: Lines 155, 216 For short you could use sr instead srcRemaining, consistent to sa, sp, sl. 420 // returns -1 if there is malformed byte(s) and the better: 420 // returns -1 if there is/are malformed byte(s) and the 466 sp -=3; There should be a space: sp -= 3; Webrev has been updated accordingly. 280 if (Character.isSurrogate(c)) 281 return malformedForLength(src, sp, dst, dp, 3); Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be interpreted again ? Actually I don't know the answer. My reading of D93a/D93b suggests that we might interpret it as a whole, given the bytes are actually in well-formed byte pattern range listed in Table 3.7, but ill-formed simply because they are surrogate value not scale value, so I would interpret the whole 3 bytes as a maximal subpart. Given D93a/b is best practices for Using U+fffd, either way should be fine. We do have Unicode expert on the list, so maybe they can share their opinion on what is the desired/recommended behavior in this case, from Standard point view? Am 29.09.2011 05:27, schrieb Xueming Shen: Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote: 5. IMHO charset CESU-8 should be hosted in extended-charsets, otherwise it should be added to java.nio.StandardCharsets We have lots of charsets provided via the standard charset provider (in rt.jar) but not listed in StandardCharsets. Yes, but the reasonable to add CESU-8 to StandardCharsets was the supposed demand to treat all unicode charsets equivalent. Otherwise there is no obstacle to host CESU-8 in extended-charsets. IMHO, CESU-8 addresses corner case compatibility issues, but not standard requirements. To put CESU-8 into standard charset provider (it is only an implementation details) does not mean it is a standard requirement, it just means it is bundled into rt.jar. The reason I put it there is to make sure it is together with the UTF-8, with the assumption is that you might need it around when using the updated UTF-8, which no longer handles those 3/6-byte surrogates. -Sherman
Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi, [I combined the proposed charge for #7082884, in which no one appears to be interested:-) into this one] Unicode Standard added Addition Constraints on conversion of ill-formed UTF-8 in version 5.1 [1] and updated in 6.0 again with further clarification [2] regarding how a conformance implementation should handle ill-formed UTF-8 byte sequence. Basically it says (1) the conversion process should not interpret any ill-formed code unit sequence (2) such process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences (3) and recommend a best practice of maximal valid sub-part for replacement The new UTF-8 charset implementation we put in JDK7 (and back-ported to previous release since then) follows the new constraints in most cases, except (1) The decoder still accepts historical 3 bytes surrogates and 6 bytes surrogate pair (the encoder never outputs such sequence). Unicode Standard tightened its UTF-8 definition in ver 3.2 [3], as Most notable among the corrigenda to the Standard is a further tightening of the definition of UTF-8, to eliminate irregular UTF-8 and to bring the Unicode specification of UTF-8 more completely into line with other specifications of UTF-8. So the 3-byte/6-byte surrogates are now defined as ill-formed code unit sequence, instead of irregular [5] in ver 3.1 (2) While no longer accepting the historical 5-byte, 6-byte UTF-8 byte sequence, the decoder treats these 5/6-byte sequence as ONE malformed unit. As a result these bytes get replaced by one replacement character, when replace for malformed is desirable (as in new String(bytes), for example). According the latest Unicode standard [2], however, because the leading byte of these 5/6-byte sequence is no longer an illegal appearance of the UTF-8, these bytes should be treated as 5-6 individual ill-formed bytes. (3)Corner case like ill-formed byte sequence ED 31 is not handled correctly/ consistently, as described in #7082884 [6] The reason behind (1) and (2) is mostly the compatibility concern. As suggested in TR#26 [4] (in which it defines CESU-8, a separate UTF encoding scheme that uses 3-6-byte sequence for supplementary characters, instead of 4-byte sequence in UTF-8), there are apps/data over there that do use surrogates pair in UTF-8 form. To change the UTF-8 charset to follow standard obviously will break someone's code when they migrate/upgrade from JDK/JRE N to N+1, something we try really hard to avoid. That said, given almost decade has passed and we are now at Unicode 6, I think the possibility of breaking someone's code/date of upgrading UTF-8 to do the right thing is small/minor. So I proposed here (1) to upgrade the JDK8 UTF-8 implementation to strictly follow the standard to a) reject 3-byte surrogate/6-byte surrogate pair b) treats 5/6-byte surrogate as individual ill-formed bytes. c) fix the corner case bug #7082884 (2) to add CESU-8 charset into JDK/JRE's charset repository (for those still prefer/work on 3-6 bytes surrogate, in UTF-8 form) Here is the webrev. The change will need to go through some in-compatible change review process, but I think we can start the code review/discussion here first. http://cr.openjdk.java.net/~sherman/7096080/webrev/ http://cr.openjdk.java.net/%7Esherman/7096080/webrev/ -Sherman [1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes [2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes [3] http://www.unicode.org/reports/tr28/tr28-3.html [4] http://unicode.org/reports/tr26/ [5] http://unicode.org/versions/corrigendum1.html [6] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-September/007722.html
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi Sherman, 1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080 2. bug 7096080 seems to be a duplicate of 6798514 - Charset UTF-8 accepts CESU-8 codings http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6798514 which was closed. It should be reopened again. 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537 4. Consider additionally 6184334 - Massive duplication of code in contains(Charset) methods in UTF-* Charsets http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6184334 5. IMHO charset CESU-8 should be hosted in extended-charsets, otherwise it should be added to java.nio.StandardCharsets -Ulf
Re: Codereview request for 7096080: UTF8 update and new CESU-8 charset
Hi, On 9/28/2011 3:44 PM, Ulf Zibis wrote: Hi Sherman, 1. bug 7096080 is not visible at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080 It might take couple days for it to show up on bugs.sun.com. But it has exactly the same content as my previous email. In fact I simply copy/pasted them into email. 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537 (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --- CoderResult.malformedForLength(1) It appears the Unicode Standard now explicitly recommends to return the malformed length 2, what UTF-8 is doing now, for this scenario (2) new byte[]{(byte)0xE1, (byte)0x40 --- CoderResult.malformedForLength(1) The change proposed actually fixed this one already (malformed length 1 (3) new byte[]{(byte)0xC0} --- CoderResult.malformedForLength(1) Technically this is not a bug, the decoder will return malformedlength 1 if you go with decode(bf,cf, true). But yes, it would be desirable to return malformed length 1 without waiting for second byte. The code/webrev has been updated to just do this as expected. Now the 2-byte sequence entry check has been updated to } else if ((b1 5) == -2 (b1 0x1e) != 0) { ... } and I no longer check the first byte for malformed2(), in which I think has the smallest performance impact for 2 bytes sequence. I ran several rounds of benchmark testing, I did not see significant difference. I will try more later. I'm not sure I understand the suggested b1 -0x3e patch, I don't see we can simply replace ((b1 5) == -2) with (b1 -0x3e). Anyway, I hope now you are motivated to take a deep look at the code:-) and maybe want to run all your tests to confirm the change is fine. This change does expose an existing bug/issue in StreamDecoder, in which the StreamDecoder fails to replace a malformed input, in which a leading byte is at the end of the stream. This is why I commended the line in Errors. I will file a bug for this one later. 5. IMHO charset CESU-8 should be hosted in extended-charsets, otherwise it should be added to java.nio.StandardCharsets We have lots of charsets provided via the standard charset provider (in rt.jar) but not listed in StandardCharsets. -Sherman