Re: [PATCH v3] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]

2023-09-29 Thread Jonathan Wakely
On Thu, 28 Sept 2023 at 20:39, Dimitrij Mijoski via Libstdc++
 wrote:
>
> This patch fixes the handling of surrogate code points in all standard
> facets for transcoding Unicode that are based on std::codecvt. Surrogate
> code points should always be treated as error. On the other hand
> surrogate code units can only appear in UTF-16 and only when they come
> in a proper pair.
>
> Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
> of bytes were given in the range [from, from_end), error was returned
> always. The last byte in such range does not form a full UTF-16 code
> unit and we can not make any decisions for error, instead partial should
> be returned.
>
> The testsuite for testing these facets was updated in the following
> order:
>
> 1. All functions that test codecvts that work with UTF-8 were refactored
>and made more generic so they accept codecvt that works with the char
>type char8_t.
> 2. The same functions were updated with new test cases for transcoding
>errors and now additionally test for surrogates, overlong UTF-8
>sequences, code points out of the Unicode range, and more tests for
>missing leading and trailing code units.
> 3. New tests were added to test codecvt_utf16 in both of its variants,
>UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.
>
> libstdc++-v3/ChangeLog:
>
> * src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
> surrogates in UTF-8.
> (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
> (ucs4_in): Fix handling of range with odd number of bytes.
> (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
> (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
> (ucs2_in): Fix handling of range with odd number of bytes.
> (__codecvt_utf16_base::do_in): Likewise.
> (__codecvt_utf16_base::do_in): Likewise.
> (__codecvt_utf16_base::do_in): Likewise.
> * testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
> tests for codecvt_utf16 and codecvt_utf16.
> * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
> testing functions for char8_t, add more test cases for errors,
> add testing functions for codecvt_utf16.
> * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
> Renames, add tests for codecvt_utf16.
> * testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06):
> Fix test.
> * testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test.

Thanks, your v2 patch was still on my TODO list. I've pushed this
version to trunk now.



[PATCH v3] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]

2023-09-28 Thread Dimitrij Mijoski
This patch fixes the handling of surrogate code points in all standard
facets for transcoding Unicode that are based on std::codecvt. Surrogate
code points should always be treated as error. On the other hand
surrogate code units can only appear in UTF-16 and only when they come
in a proper pair.

Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
of bytes were given in the range [from, from_end), error was returned
always. The last byte in such range does not form a full UTF-16 code
unit and we can not make any decisions for error, instead partial should
be returned.

The testsuite for testing these facets was updated in the following
order:

1. All functions that test codecvts that work with UTF-8 were refactored
   and made more generic so they accept codecvt that works with the char
   type char8_t.
2. The same functions were updated with new test cases for transcoding
   errors and now additionally test for surrogates, overlong UTF-8
   sequences, code points out of the Unicode range, and more tests for
   missing leading and trailing code units.
3. New tests were added to test codecvt_utf16 in both of its variants,
   UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.

libstdc++-v3/ChangeLog:

* src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
surrogates in UTF-8.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
(ucs4_in): Fix handling of range with odd number of bytes.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
(ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
(ucs2_in): Fix handling of range with odd number of bytes.
(__codecvt_utf16_base::do_in): Likewise.
(__codecvt_utf16_base::do_in): Likewise.
(__codecvt_utf16_base::do_in): Likewise.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
tests for codecvt_utf16 and codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
testing functions for char8_t, add more test cases for errors,
add testing functions for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
Renames, add tests for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06):
Fix test.
* testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test.
---
 libstdc++-v3/src/c++11/codecvt.cc |   18 +-
 .../22_locale/codecvt/codecvt_unicode.cc  |   38 +-
 .../22_locale/codecvt/codecvt_unicode.h   | 1799 +
 .../codecvt/codecvt_unicode_char8_t.cc|   53 +
 .../codecvt/codecvt_unicode_wchar_t.cc|   32 +-
 .../22_locale/codecvt/codecvt_utf16/79980.cc  |2 +-
 6 files changed, 1493 insertions(+), 449 deletions(-)
 create mode 100644 
libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc

diff --git a/libstdc++-v3/src/c++11/codecvt.cc 
b/libstdc++-v3/src/c++11/codecvt.cc
index 02f05752d..2cc812cfc 100644
--- a/libstdc++-v3/src/c++11/codecvt.cc
+++ b/libstdc++-v3/src/c++11/codecvt.cc
@@ -284,6 +284,8 @@ namespace
return invalid_mb_sequence;
   if (c1 == 0xE0 && c2 < 0xA0) [[unlikely]] // overlong
return invalid_mb_sequence;
+  if (c1 == 0xED && c2 >= 0xA0) [[unlikely]] // surrogate
+   return invalid_mb_sequence;
   if (avail < 3) [[unlikely]]
return incomplete_mb_character;
   char32_t c3 = (unsigned char) from[2];
@@ -484,6 +486,8 @@ namespace
 while (from.size())
   {
const char32_t c = from[0];
+   if (0xD800 <= c && c <= 0xDFFF) [[unlikely]]
+ return codecvt_base::error;
if (c > maxcode) [[unlikely]]
  return codecvt_base::error;
if (!write_utf8_code_point(to, c)) [[unlikely]]
@@ -508,7 +512,7 @@ namespace
  return codecvt_base::error;
to = codepoint;
   }
-return from.size() ? codecvt_base::partial : codecvt_base::ok;
+return from.nbytes() ? codecvt_base::partial : codecvt_base::ok;
   }
 
   // ucs4 -> utf16
@@ -521,6 +525,8 @@ namespace
 while (from.size())
   {
const char32_t c = from[0];
+   if (0xD800 <= c && c <= 0xDFFF) [[unlikely]]
+ return codecvt_base::error;
if (c > maxcode) [[unlikely]]
  return codecvt_base::error;
if (!write_utf16_code_point(to, c, mode)) [[unlikely]]
@@ -653,7 +659,7 @@ namespace
 while (from.size() && to.size())
   {
char16_t c = from[0];
-   if (is_high_surrogate(c))
+   if (0xD800 <= c && c <= 0xDFFF)
  return codecvt_base::error;
if (c > maxcode)
  return codecvt_base::error;
@@ -680,7 +686,7 @@ namespace
  return codecvt_base::error;
to = c;
   }
-return from.size() == 0 ? codecvt_base::ok : codecvt_base::partial;
+return from.nbytes() == 0 ? codecvt_base::ok : codecvt_base::partial;
   }
 
   const char16_t*
@@ -1344,8 +1350,6 @@