Hi,

Please help review the changes for j.u.z.ZipCoder/JDK-8184947 (which also includes cleanup/improvement work in java.lang.StringCoding.java to speed up general String
coding performance, especially for UTF8).

issue: https://bugs.openjdk.java.net/browse/JDK-8184947
webrev: http://cr.openjdk.java.net/~sherman/8184947/webrev

jmh benchmark:
http://cr.openjdk.java.net/~sherman/8184947/ZipCodingBM.java
http://cr.openjdk.java.net/~sherman/8184947/StringCodingBM.java

Notes:

(1) StringCoding.de/encode() for new String()/String.getBytes() with default charset.

For historical reason the existing SC.decode(byte[], off, len)/encode(coder, val)
implementation has code to handle any "possible" UnsupportedEncodingExcetion
situation and turn to the slow "charset name" version of de/encode() for real work. Given the fact that the Charset.defaultCharset() now returns UTF8 as the fallback default charset if there is anything wrong to obtain a default charset (we did that in jdk7 or 8?), there is no need actually to handle the UEE. This also provides the
opportunity to use fastpath for stateless UTF8/88591/ASCII de/encode(). The
benchmark data for newString_xxx/ getBytes_xxx (which uses the default encoding,
UTF8  in this case) suggests a big speed up fo ascii-only String.

StringCodingBM         size)  Mode  Cnt   NEW Score   Error     OLD Score    
Error  Units


getBytes_ASCII            16  avgt    5    21.155 ±   5.586      63.777 ±   
54.262  ns/op
getBytes_ASCII            64  avgt    5    20.854 ±   6.237      98.988 ±   
62.932  ns/op
getBytes_ASCII           256  avgt    5    38.291 ±   8.494     272.306 ±   
77.951  ns/op
getBytes_Latin            16  avgt    5    80.968 ±  15.814      76.769 ±   
38.512  ns/op
getBytes_Latin            64  avgt    5   163.078 ±  51.993     219.085 ±   
42.665  ns/op
getBytes_Latin           256  avgt    5   759.548 ±  99.386     824.594 ±  
763.735  ns/op
getBytes_Unicode          16  avgt    5    94.311 ±  22.189     124.185 ±   
32.751  ns/op
getBytes_Unicode          64  avgt    5   289.603 ± 152.056     321.541 ±  
103.703  ns/op
getBytes_Unicode         256  avgt    5  1253.098 ± 216.243    1201.667 ±  
512.532  ns/op

newString_ASCII           16  avgt    5    33.273 ±  13.780      50.402 ±   
17.574  ns/op
newString_ASCII           64  avgt    5    30.420 ±   6.207      84.989 ±   
43.355  ns/op
newString_ASCII          256  avgt    5    54.391 ±  10.451     208.096 ±  
102.716  ns/op
newString_Latin           16  avgt    5   115.606 ±   7.181     114.186 ±   
36.310  ns/op
newString_Latin           64  avgt    5   393..710 ±  73.478    414.286 ±  
176.837  ns/op
newString_Latin          256  avgt    5  1618.967 ± 289.044    1551.499 ±  
487.904  ns/op
newString_Unicode         16  avgt    5   104.848 ±  32.694     127.558 ±   
12.029  ns/op
newString_Unicode         64  avgt    5   377.894 ± 147.731     374.779 ±   
53.028  ns/op
newString_Unicode        256  avgt    5  1557.977 ± 318.652    1457.236 ±  
284.424  ns/op


(2) updated to "fast path" UTF8/8859-1/ASCII in all de/coding operation, which are all implemented in static /stateless methods. (benchmark for MS932 [4] provide to make
sure no regression for "other" charsets)

(3) added "fast path" for "ascii-only' bytes for utf8 encoding/getBytes(). The benchmark [1] suggests a big speedup for ascii-only getBytes() with limited cost to non-ascii-only cases. (this helps big for (4), the ZipCoder situation, which mainly uses ascii only).

(4) java.util.zip.ZipCoder

This is where this patch actually started from. As the rfe suggested we are now using byte[] as the internal storage for the String class, the optimization we put in ZipCoder for UTF8 (which uses the byte[]/char[] interface of out UTF8 implementation to help avoid the relatively heavy ByteBuffer/CharBuffer coding interface) now appears to be
not that "optimized". The to/from char[] copy/paste has become a waste.

ZipCoder implementation can't use new String/String.getBytes() directly because of the the different malformed/unmappable character handing requirement. The proposed change here is to add a pair of special new String()/String.getBytes() in StrngCoding class to throw IAE instead of silent replacement, via (yet another) SharedSecrets interface. This brings us much faster de/encoding (30%-50% speed up) and much less memory usage (no more unnecessary byte[]/char[] allocation and in default mode, there is only ONE utf8 ZipCoder), on all "Jar/ZipEntry" related access operations.

ZipCodeBenchMark [latest]
    * "New Score" is with the patch
* getEntry() is mainly String.getBytes(), entries()/stream() is mainly new String(bytes)).

               Mode  Cnt     New Score   Error      Old Score         Units
jf_entries     avgt   20     0.582 ±    0.036      0.953 ±   0.108   ms/op
jf_getEntry    avgt   20     1.506 ±    0.158      2.052 ±   0.171   ms/op
jf_stream      avgt   20     0.698 ±    0.060      0.940 ±   0.067   ms/op
zf_entries     avgt   20     0.691 ±    0.057      0.917 ±   0.080   ms/op
zf_getEntry    avgt   20     1.459 ±    0.180      2.081 ±   0.161   ms/op
zf_stream      avgt   20     0.626 ±    0.074      0.909 ±   0.075   ms/op



Thanks,
Sherman

[1] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.utf8
[2]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.8859_1
[3] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ascii
[4]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ms932
[5] http://cr.openjdk.java.net/~sherman/8184947/ZipCoding.bm




Reply via email to