Hi Mark,
Thank you for the crystal clear explanation. I will go ahead and clarify
the method description.
Naoto
On 9/20/19 3:03 PM, mark.reinh...@oracle.com wrote:
2019/9/20 13:25:38 -0700, naoto.s...@oracle.com:
I am looking at the following bug:
https://bugs.openjdk.java.net/browse/JDK-8230531
and hoping someone who is familiar with the encoder will clear things
out. As in the bug report, the method description reads:
--
Returns the maximum number of bytes that will be produced for each
character of input. This value may be used to compute the worst-case
size of the output buffer required for a given input sequence.
--
Initially I thought it would return the maximum number of encoded bytes
for an arbitrary input "char" value, i.e. a code unit of UTF-16
encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and
UTF-16LE) would return 2 from the method, as the code unit is a 16 bit
value. In reality, the encoder of UTF-16 Charset returns 4, which
accounts for the initial byte-order mark (2 bytes for a code unit, plus
size of the BOM).
Exactly. A comment in the implementation, in sun.nio.cs.UnicodeEncoder,
mentions this (perhaps you already saw it):
protected UnicodeEncoder(Charset cs, int bo, boolean m) {
super(cs, 2.0f,
* // Four bytes max if you need a BOM
* m ? 4.0f : 2.0f,
// Replacement depends upon byte order
((bo == BIG)
? new byte[] { (byte)0xff, (byte)0xfd }
: new byte[] { (byte)0xfd, (byte)0xff }));
usesMark = needsMark = m;
byteOrder = bo;
}
This is justifiable because it is meant to be the
worst case scenario, though. I believe this implementation has been
there since the inception of java.nio, i.e., JDK 1.4.
Yes, it has.
Obviously I can clarify the spec of maxBytesPerChar() to account for the
conversion independent prefix (or suffix) bytes, such as BOM, but I am
not sure the initial intent of the method. If it intends to return pure
max bytes for a single input char, UTF-16 should also have been
returning 2. But in that case, caller would not be able to calculate the
worst case byte buffer size as in the bug report.
The original intent is that the return value of this method can be used
to allocate a buffer that is guaranteed to be large enough for any
possible output. Returning 2 for UTF-16 would, as you observe, not work
for that purpose.
To avoid this confusion, a more verbose specification might read:
* Returns the maximum number of $otype$s that will be produced for each
* $itype$ of input. This value may be used to compute the worst-case size
* of the output buffer required for a given input sequence. This value
* accounts for any necessary content-independent prefix or suffix
#if[encoder]
* $otype$s, such as byte-order marks.
#end[encoder]
#if[decoder]
* $otype$s.
#end[decoder]
(The example of byte-order marks applies only to CharsetEncoders, so
I’ve conditionalized that text for Charset-X-Coder.java.template.)
- Mark