On 1/25/23 13:06, Ben Boeckel wrote:
Unicode does not support such values because they are unrepresentable in
UTF-16.

libcpp/

        * charset.cc: Reject encodings of codepoints above 0x10FFFF.
        UTF-16 does not support such codepoints and therefore all
        Unicode rejects such values.

It seems that this causes a bunch of testsuite failures from tests that expect this limit to be checked elsewhere with a different diagnostic, so I think the easiest thing is to fold this into _cpp_valid_utf8_str instead, i.e.:

Make sense?

Jason
From 296e9d1e16533979d12bd98db2937e396a0796f3 Mon Sep 17 00:00:00 2001
From: Ben Boeckel <ben.boec...@kitware.com>
Date: Sat, 10 Dec 2022 17:20:49 -0500
Subject: [PATCH] libcpp: add a function to determine UTF-8 validity of a C
 string
To: gcc-patc...@gcc.gnu.org

This simplifies the interface for other UTF-8 validity detections when a
simple "yes" or "no" answer is sufficient.

libcpp/

	* charset.cc: Add `_cpp_valid_utf8_str` which determines whether
	a C string is valid UTF-8 or not.
	* internal.h: Add prototype for `_cpp_valid_utf8_str`.

Signed-off-by: Ben Boeckel <ben.boec...@kitware.com>
---
 libcpp/internal.h |  2 ++
 libcpp/charset.cc | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9724676a8cd..48520901b2d 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -834,6 +834,8 @@ extern bool _cpp_valid_utf8 (cpp_reader *pfile,
 			     struct normalize_state *nst,
 			     cppchar_t *cp);
 
+extern bool _cpp_valid_utf8_str (const char *str);
+
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index 3c47d4f868b..42a1b596c06 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1864,6 +1864,30 @@ _cpp_valid_utf8 (cpp_reader *pfile,
   return true;
 }
 
+/*  Detect whether a C-string is a valid UTF-8-encoded set of bytes. Returns
+    `false` if any contained byte sequence encodes an invalid Unicode codepoint
+    or is not a valid UTF-8 sequence. Returns `true` otherwise. */
+
+extern bool
+_cpp_valid_utf8_str (const char *name)
+{
+  const uchar* in = (const uchar*)name;
+  size_t len = strlen (name);
+  cppchar_t cp;
+
+  while (*in)
+    {
+      if (one_utf8_to_cppchar (&in, &len, &cp))
+	return false;
+
+      /* one_utf8_to_cppchar doesn't check this limit.  */
+      if (cp > UCS_LIMIT)
+	return false;
+    }
+
+  return true;
+}
+
 /* Subroutine of convert_hex and convert_oct.  N is the representation
    in the execution character set of a numeric escape; write it into the
    string buffer TBUF and update the end-of-string pointer therein.  WIDE
-- 
2.31.1

Reply via email to