https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2015-06-09
           Assignee|unassigned at gcc dot gnu.org      |redi at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Leo Carreon from comment #0)
> I just noticed that codecvt_utf16<char32_t>::max_length() is returning 3.
> 
> This appears to be the wrong value because a surrogate pair is composed of 4
> bytes therefore max_length() should at least be returning 4.

Agreed, I think that's just a mistake.

I wrote this comment in the code:

int
codecvt<char16_t, char, mbstate_t>::do_max_length() const throw()
{
  // Any valid UTF-8 sequence of 3 bytes fits in a single 16-bit code unit,
  // whereas 4 byte sequences require two 16-bit code units.
  return 3;
}

But that reasoning (even if it's correct!) doesn't apply to
codecvt_utf16<char32_t>.

> I'm also wondering whether the BOM should be taken into account.  If it so
> happens that at the beginning of a UTF-16 string which has a BOM and it so
> happens to start with a surrogate pair, 6 bytes have to be consumed to
> generate a single UCS-4 character.
> 
> Should the same thing be considered with
> codecvt_utf8<char32_t>::max_length() which currently returns 4.  Taking into
> account the BOM and the longest UTF-8 character below 0x10FFFF, shouldn't
> max_length() return 7.
> 
> I'm not really sure if the BOM should be taken into account because the
> standard's definition for do_max_length() simply says the maximum number of
> input characters that needs to be consumed to generate a single output
> character.

That's a very good question.

Reply via email to