Hello,

I am looking at the following bug:

https://bugs.openjdk.java.net/browse/JDK-8230531

and hoping someone who is familiar with the encoder will clear things out. As in the bug report, the method description reads:

--
Returns the maximum number of bytes that will be produced for each character of input. This value may be used to compute the worst-case size of the output buffer required for a given input sequence.
--

Initially I thought it would return the maximum number of encoded bytes for an arbitrary input "char" value, i.e. a code unit of UTF-16 encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and UTF-16LE) would return 2 from the method, as the code unit is a 16 bit value. In reality, the encoder of UTF-16 Charset returns 4, which accounts for the initial byte-order mark (2 bytes for a code unit, plus size of the BOM). This is justifiable because it is meant to be the worst case scenario, though. I believe this implementation has been there since the inception of java.nio, i.e., JDK1.4.

Obviously I can clarify the spec of maxBytesPerChar() to account for the conversion independent prefix (or suffix) bytes, such as BOM, but I am not sure the initial intent of the method. If it intends to return pure max bytes for a single input char, UTF-16 should also have been returning 2. But in that case, caller would not be able to calculate the worst case byte buffer size as in the bug report.

Naoto

Reply via email to