CharsetEncoder.maxBytesPerChar()

naoto . sato Fri, 20 Sep 2019 13:27:08 -0700

Hello,

I am looking at the following bug:


https://bugs.openjdk.java.net/browse/JDK-8230531

and hoping someone who is familiar with the encoder will clear thingsout. As in the bug report, the method description reads:

--

Returns the maximum number of bytes that will be produced for eachcharacter of input. This value may be used to compute the worst-casesize of the output buffer required for a given input sequence.

--

Initially I thought it would return the maximum number of encoded bytesfor an arbitrary input "char" value, i.e. a code unit of UTF-16encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, andUTF-16LE) would return 2 from the method, as the code unit is a 16 bitvalue. In reality, the encoder of UTF-16 Charset returns 4, whichaccounts for the initial byte-order mark (2 bytes for a code unit, plussize of the BOM). This is justifiable because it is meant to be theworst case scenario, though. I believe this implementation has beenthere since the inception of java.nio, i.e., JDK1.4.

Obviously I can clarify the spec of maxBytesPerChar() to account for theconversion independent prefix (or suffix) bytes, such as BOM, but I amnot sure the initial intent of the method. If it intends to return puremax bytes for a single input char, UTF-16 should also have beenreturning 2. But in that case, caller would not be able to calculate theworst case byte buffer size as in the bug report.


Naoto

CharsetEncoder.maxBytesPerChar()

Reply via email to