Re: RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

Ichiroh Takiguchi Fri, 26 Aug 2022 00:19:36 -0700

On Mon, 8 Aug 2022 09:22:32 GMT, Alan Bateman <al...@openjdk.org> wrote:


>> Hello @AlanBateman .
>> Sorry I'm late.
>> I got some responses from ICU. 
>> [ICU-22091](https://unicode-org.atlassian.net/browse/ICU-22091)
>> I'm not sure if they're interested in the new charset...
>> 
>> As you know `sun.nio.cs.ArrayDecoder` and `sun.nio.cs.ArrayEncoder`interface 
>> have performance advantage.
>> And some other performance advantages are there on built-in charset 
>> decoder/encoder.
>> Is it possible to create simple public API by using `sun.nio.cs.SingleByte` 
>> and `sun.nio.cs.DoubleByte*` classes?
>> We'd like to use stable conversion loop.
>
>> As you know `sun.nio.cs.ArrayDecoder` and `sun.nio.cs.ArrayEncoder`interface 
>> have performance advantage. And some other performance advantages are there 
>> on built-in charset decoder/encoder. Is it possible to create simple public 
>> API by using `sun.nio.cs.SingleByte` and `sun.nio.cs.DoubleByte*` classes? 
>> We'd like to use stable conversion loop.
> 
> If they have ASCII compatible regions then that may be so but I haven't see 
> any performance data published on that. Do you know if any experiments that 
> have deployed a CharsetProvider for the EBCDIC charsets and compared the 
> performance with the charsets that in the JDK? There may be merit in 
> exploring adding base abstracts implementations of 
> CharsetEncoder/CharsetDecoder to java.nio.charsets.spi to support single and 
> double byte charsets to see how such base implementations might look, how 
> they would help performance, and if there are any security downsides.

Hello @AlanBateman .
Sorry, I'm late.
Test result is attached (not guaranteed).

I created attached small test program, I'm not sure it's good or not

import java.nio.*;
import java.nio.charset.*;

public class tc {
  public static void main(String[] args) throws Exception {
    Charset cs = Charset.forName(args[0]);
    int cnt = Integer.parseInt(args[1]);
    boolean useCA = "1".equals(args[2]);
    boolean useBA = "1".equals(args[3]);
    CharsetEncoder ce = cs.newEncoder();
    byte[] ba = new byte[0x4000];
    for(int i = 0; i < ba.length; i++) {
      ba[i] = (byte) i;
    }
    String s = new String(ba, cs);
    char[] ca = s.toCharArray();
    ByteBuffer bb = useBA ? ByteBuffer.allocate(ca.length) : 
ByteBuffer.allocateDirect(ca.length);;
    CharBuffer cb = useCA ? CharBuffer.wrap(ca) : CharBuffer.wrap(s);
    System.out.println("CharBuffer.hasArray() = " + cb.hasArray());
    System.out.println("ByteBuffer.hasArray() = " + bb.hasArray());
    long start_t = System.currentTimeMillis();
    for(int i = 0; i < 200; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Warmup: "+(System.currentTimeMillis() - start_t));
    start_t = System.currentTimeMillis();
    for(int i = 0; i < cnt; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Test: "+(System.currentTimeMillis() - start_t));
  }
}


Following test result is just for my test environment
* CPU: Intel (On-premises environment), not same machine
* Executed 5 times, the values are their average 

Use following options, like
OpenJDK:
`java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1`
ICU4J
`java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 
1`

I used jdk-20 b12
Only A/A with OpenJDK uses ArrayEncoder (ArrayDecoder) interface

| | A/A | A/B | B/A | B/B |
| -- | --: | --: | --: | --: |
| Linux (OpenJDK) | 862 | 1265 | 1838 | 1843 |
| Linux (ICU4J) | 1450 | 1410 | 1152 | 1138 |
| Windows (OpenJDK) | 921 | 1231 | 1959 | 1850 |
| Windows (ICU4J) | 1431 | 1446 | 2227 | 2265 |
| Mac (OpenJDK) | 820 | 1163 | 1799 | 1774 |
| Mac (ICU4J) | 1282 | 1242 | 994 | 1049 |

Notes:
* A/A means CharBuffer is created via char[], ByteBuffer is generated by 
allocate()
* A/B means CharBuffer is created via char[], ByteBuffer is generated by 
allocateDirect()
* B/A means CharBuffer is created via String, ByteBuffer is generated by 
allocate()
* B/B means CharBuffer is created via String, ByteBuffer is generated by 
allocateDirect()

Actually, I'm confused by this result.
Previously, I was just comparing A/A with B/B on OpenJDK's charset.
I didn't think ICU4J's result would make a difference.

Anyway, please evaluate about this result.
And please let me know if I need more investigation.

-------------

PR: https://git.openjdk.org/jdk/pull/9399

Re: RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

Reply via email to