https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |5.2

--- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Leo Carreon from comment #2)
> Just clarifying that my comments are to do with codecvt_utf16<char32_t> and
> codecvt_utf8<char32_t>.
> 
> The way I understand it, codecvt_utf16<char32_t> should be converting
> between UTF-16 and UCS-4.  UTF-16 uses 2 bytes for characters in the BMP
> (characters in the range 0x0000 to 0xFFFF) and 4 bytes (surrogate pairs) for
> characters above the BMP (0x010000 to 0x10FFFF).  UCS-4 uses 4 byte values. 
> Therefore, codecvt_utf16<char32_t>::max_length() should be returning 4 if
> the BOM is not taken into account.

Yes, that's now fixed.

> codecvt_utf8<char32_t> converts between UTF-8 and UCS-4.  UTF-8 can use up
> to 4 bytes for characters up to the range 0x10FFFF.  Therefore,
> codecvt_utf8<char32_t>::max_length() should be returning 4 if the BOM is not
> taken into account.
> 
> As I said in my previous post, I'm not sure if the BOM should be accounted
> for in max_length().

I've raised that question with the C++ committee.

>  If I'm not mistaken, the purpose of this function is
> to allow a user to estimate how many bytes are required to fit a UCS-4
> string when converted to either UTF-16 or UTF-8.  And my guess, the BOM can
> be taken into account separately when doing the estimation.  For example,
> when wstring_convert estimates the length of the std::string to be generated
> by wstring_convert::to_bytes().  It should be the number of UCS-4 characters
> multiplied by max_length() and then add the size of the BOM if required. 
> The resulting std::string can be resized after the conversion to eliminate
> the unused bytes.

I believe that's the usual use case for max_length, and agree it's better to
calculate N * max_length() + length(BOM), rather than have max_length() include
the BOM, however the way max_length() is specified in the standard does suggest
it should be including the BOM. We'll discuss it in the committee and process
it as a defect report against the standard if necessary.

> Note that the comment you mentioned in your reply probably only applies to
> codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
> going thru the UCS-4 conversion.

Agreed.

Reply via email to