2019/9/20 13:25:38 -0700, naoto.s...@oracle.com: > I am looking at the following bug: > > https://bugs.openjdk.java.net/browse/JDK-8230531 > > and hoping someone who is familiar with the encoder will clear things > out. As in the bug report, the method description reads: > > -- > Returns the maximum number of bytes that will be produced for each > character of input. This value may be used to compute the worst-case > size of the output buffer required for a given input sequence. > -- > > Initially I thought it would return the maximum number of encoded bytes > for an arbitrary input "char" value, i.e. a code unit of UTF-16 > encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and > UTF-16LE) would return 2 from the method, as the code unit is a 16 bit > value. In reality, the encoder of UTF-16 Charset returns 4, which > accounts for the initial byte-order mark (2 bytes for a code unit, plus > size of the BOM).
Exactly. A comment in the implementation, in sun.nio.cs.UnicodeEncoder, mentions this (perhaps you already saw it): protected UnicodeEncoder(Charset cs, int bo, boolean m) { super(cs, 2.0f, * // Four bytes max if you need a BOM * m ? 4.0f : 2.0f, // Replacement depends upon byte order ((bo == BIG) ? new byte[] { (byte)0xff, (byte)0xfd } : new byte[] { (byte)0xfd, (byte)0xff })); usesMark = needsMark = m; byteOrder = bo; } > This is justifiable because it is meant to be the > worst case scenario, though. I believe this implementation has been > there since the inception of java.nio, i.e., JDK 1.4. Yes, it has. > Obviously I can clarify the spec of maxBytesPerChar() to account for the > conversion independent prefix (or suffix) bytes, such as BOM, but I am > not sure the initial intent of the method. If it intends to return pure > max bytes for a single input char, UTF-16 should also have been > returning 2. But in that case, caller would not be able to calculate the > worst case byte buffer size as in the bug report. The original intent is that the return value of this method can be used to allocate a buffer that is guaranteed to be large enough for any possible output. Returning 2 for UTF-16 would, as you observe, not work for that purpose. To avoid this confusion, a more verbose specification might read: * Returns the maximum number of $otype$s that will be produced for each * $itype$ of input. This value may be used to compute the worst-case size * of the output buffer required for a given input sequence. This value * accounts for any necessary content-independent prefix or suffix #if[encoder] * $otype$s, such as byte-order marks. #end[encoder] #if[decoder] * $otype$s. #end[decoder] (The example of byte-order marks applies only to CharsetEncoders, so I’ve conditionalized that text for Charset-X-Coder.java.template.) - Mark