Much of the documentation (especially the early stuff when supplementary characters were rarer/nonexistent) doesn't distinguish between "character (codepoint)" and "char" clearly enough. Fixing that in all the docs would be a fine thing to do.
On Mon, Sep 22, 2014 at 2:34 PM, Mark Thomas <ma...@apache.org> wrote: > On 22/09/2014 22:23, Martin Buchholz wrote: > > I think you are mistaken. It's maxBytesPerChar, not maxBytesPerCodepoint! > > You are going to have to explain that some more. The Javadoc for > CharsetEncoder.maxBytesPerChar() is explicit: > <quote> > Returns the maximum number of bytes that will be produced for each > character of input. > </quote> > > For UTF-8 that number is 4, not 3. A quick look at the source for the > default UTF-8 encoder confirms that there are cases where it will output > 4 bytes for a single input character. > > Mark > > > > > > > > changeset: 3116:b44704ce8a08 > > user: sherman > > date: 2010-11-19 12:58 -0800 > > 6957230: CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be > 3 > > Summary: changged utf-8's CharsetEncoder.maxBytesPerChar to 3 > > Reviewed-by: alanb > > > > > > On Mon, Sep 22, 2014 at 1:14 PM, Ivan Gerasimov < > ivan.gerasi...@oracle.com> > > wrote: > > > >> Hello! > >> > >> The UTF-8 encoding allows characters that are 4 bytes long. > >> However, CharsetEncoder.maxBytesPerChar() currently returns 3.0, which > is > >> not always enough. > >> > >> Would you please review the simple fix for this issue? > >> > >> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8058875 > >> WEBREV: http://cr.openjdk.java.net/~igerasim/8058875/0/webrev/ > >> > >> Sincerely yours, > >> Ivan > >> > >