On 22/09/2014 22:46, Xueming Shen wrote: > On 09/22/2014 01:14 PM, Ivan Gerasimov wrote: >> Hello! >> >> The UTF-8 encoding allows characters that are 4 bytes long. >> However, CharsetEncoder.maxBytesPerChar() currently returns 3.0, which >> is not always enough. >> >> Would you please review the simple fix for this issue? >> >> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8058875 >> WEBREV: http://cr.openjdk.java.net/~igerasim/8058875/0/webrev/ >> >> Sincerely yours, >> Ivan > > The "character" in the nio Charset and CharDe/Encoder is specified as > "sixteen-bit Unicode > code unit", so it is reasonable to interpret the "character" in the > "maximum number of bytes > that will be produced for each character of input" to be the Java "char" > as well. In case of > UTF8, each 4-byte form supplementary character is always coded into 2 > surrogate chars, > it's "2 byte per char". Do we have a real escalation that complains > about this?
Ah. Got it. I see now. There are single chars that will result in 3 bytes of output but to get 4 bytes of output requires 2 chars of input. In which case the current value of 3.0 makes sense. Cheers, Mark