https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464
Jonathan Wakely <redi at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2015-06-09 Assignee|unassigned at gcc dot gnu.org |redi at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> --- (In reply to Leo Carreon from comment #0) > I just noticed that codecvt_utf16<char32_t>::max_length() is returning 3. > > This appears to be the wrong value because a surrogate pair is composed of 4 > bytes therefore max_length() should at least be returning 4. Agreed, I think that's just a mistake. I wrote this comment in the code: int codecvt<char16_t, char, mbstate_t>::do_max_length() const throw() { // Any valid UTF-8 sequence of 3 bytes fits in a single 16-bit code unit, // whereas 4 byte sequences require two 16-bit code units. return 3; } But that reasoning (even if it's correct!) doesn't apply to codecvt_utf16<char32_t>. > I'm also wondering whether the BOM should be taken into account. If it so > happens that at the beginning of a UTF-16 string which has a BOM and it so > happens to start with a surrogate pair, 6 bytes have to be consumed to > generate a single UCS-4 character. > > Should the same thing be considered with > codecvt_utf8<char32_t>::max_length() which currently returns 4. Taking into > account the BOM and the longest UTF-8 character below 0x10FFFF, shouldn't > max_length() return 7. > > I'm not really sure if the BOM should be taken into account because the > standard's definition for do_max_length() simply says the maximum number of > input characters that needs to be consumed to generate a single output > character. That's a very good question.