This response confuses me. Are you saying that the UTF8 encoder is not really producing UTF8? RFC 2279 and 3629 both clearly state that surrogates must be combined to form a 32-bit value which is then encoded as a 4-byte sequence. In fact, the RFCs refer to the alternate encoding CESU_8 definition which encodes each half of the surrogate pair as a 3-byte UTF-8 sequence.
I guess returning 3.0 for maxBytesPerChar works for the purpose of allocating a big enough byte buffer, but the explanation in this thread is confusing. Tom Salter ------------------------------ Date: Tue, 23 Sep 2014 11:37:07 +0400 From: Ivan Gerasimov <ivan.gerasi...@oracle.com> To: Xueming Shen <xueming.s...@oracle.com>, Martin Buchholz <marti...@google.com> Cc: nio-...@openjdk.java.net, core-libs-dev <core-libs-dev@openjdk.java.net> Subject: Re: RFR [8058875]: CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8 Message-ID: <54212323.5080...@oracle.com> Content-Type: text/plain; charset=UTF-8; format=flowed Martin, Sherman thanks for clarification! Closing the bug as not a bug. > The "character" in the nio Charset and CharDe/Encoder is specified as > "sixteen-bit Unicode > code unit", so it is reasonable to interpret the "character" in the > "maximum number of bytes > that will be produced for each character of input" to be the Java > "char" as well. In case of > UTF8, each 4-byte form supplementary character is always coded into 2 > surrogate chars, > it's "2 byte per char". > Do we have a real escalation that complains about this? > Yes, the link in on the bug page: https://bugs.openjdk.java.net/browse/JDK-8058875 I'm going to try to explain what I've just realized about this function :-) Sincerely yours, Ivan