https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
Jonathan Wakely <redi at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution|--- |FIXED Target Milestone|--- |5.2 --- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> --- (In reply to Leo Carreon from comment #2) > Just clarifying that my comments are to do with codecvt_utf16<char32_t> and > codecvt_utf8<char32_t>. > > The way I understand it, codecvt_utf16<char32_t> should be converting > between UTF-16 and UCS-4. UTF-16 uses 2 bytes for characters in the BMP > (characters in the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for > characters above the BMP (0x010000 to 0x10FFFF). UCS-4 uses 4 byte values. > Therefore, codecvt_utf16<char32_t>::max_length() should be returning 4 if > the BOM is not taken into account. Yes, that's now fixed. > codecvt_utf8<char32_t> converts between UTF-8 and UCS-4. UTF-8 can use up > to 4 bytes for characters up to the range 0x10FFFF. Therefore, > codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not > taken into account. > > As I said in my previous post, I'm not sure if the BOM should be accounted > for in max_length(). I've raised that question with the C++ committee. > If I'm not mistaken, the purpose of this function is > to allow a user to estimate how many bytes are required to fit a UCS-4 > string when converted to either UTF-16 or UTF-8. And my guess, the BOM can > be taken into account separately when doing the estimation. For example, > when wstring_convert estimates the length of the std::string to be generated > by wstring_convert::to_bytes(). It should be the number of UCS-4 characters > multiplied by max_length() and then add the size of the BOM if required. > The resulting std::string can be resized after the conversion to eliminate > the unused bytes. I believe that's the usual use case for max_length, and agree it's better to calculate N * max_length() + length(BOM), rather than have max_length() include the BOM, however the way max_length() is specified in the standard does suggest it should be including the BOM. We'll discuss it in the committee and process it as a defect report against the standard if necessary. > Note that the comment you mentioned in your reply probably only applies to > codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without > going thru the UCS-4 conversion. Agreed.