Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be public

Richard W.M. Jones Fri, 02 Aug 2024 14:00:58 -0700

On Fri, Aug 02, 2024 at 02:26:05PM -0500, Eric Blake wrote:
> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals.  Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated.  [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
> 
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters).  But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+0000.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>


I have to admit I'd never heard of "modified UTF-8" before, but it's a
thing:

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

As the patch is almost a simple code motion:

Reviewed-by: Richard W.M. Jones <rjo...@redhat.com>

Rich.

> ---
>  include/qemu/unicode.h |  3 ++
>  qobject/json-writer.c  | 47 +----------------------
>  util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+), 46 deletions(-)
> 
> diff --git a/include/qemu/unicode.h b/include/qemu/unicode.h
> index 7fa10b8e604..e1013b29f72 100644
> --- a/include/qemu/unicode.h
> +++ b/include/qemu/unicode.h
> @@ -4,4 +4,7 @@
>  int mod_utf8_codepoint(const char *s, size_t n, char **end);
>  ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint);
> 
> +void mod_utf8_sanitize(GString *buf, const char *str);
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len);
> +
>  #endif
> diff --git a/qobject/json-writer.c b/qobject/json-writer.c
> index 309a31d57a9..5c14574efee 100644
> --- a/qobject/json-writer.c
> +++ b/qobject/json-writer.c
> @@ -104,53 +104,8 @@ static void pretty_newline_or_space(JSONWriter *writer)
> 
>  static void quoted_str(JSONWriter *writer, const char *str)
>  {
> -    const char *ptr;
> -    char *end;
> -    int cp;
> -
>      g_string_append_c(writer->contents, '"');
> -
> -    for (ptr = str; *ptr; ptr = end) {
> -        cp = mod_utf8_codepoint(ptr, 6, &end);
> -        switch (cp) {
> -        case '\"':
> -            g_string_append(writer->contents, "\\\"");
> -            break;
> -        case '\\':
> -            g_string_append(writer->contents, "\\\\");
> -            break;
> -        case '\b':
> -            g_string_append(writer->contents, "\\b");
> -            break;
> -        case '\f':
> -            g_string_append(writer->contents, "\\f");
> -            break;
> -        case '\n':
> -            g_string_append(writer->contents, "\\n");
> -            break;
> -        case '\r':
> -            g_string_append(writer->contents, "\\r");
> -            break;
> -        case '\t':
> -            g_string_append(writer->contents, "\\t");
> -            break;
> -        default:
> -            if (cp < 0) {
> -                cp = 0xFFFD; /* replacement character */
> -            }
> -            if (cp > 0xFFFF) {
> -                /* beyond BMP; need a surrogate pair */
> -                g_string_append_printf(writer->contents, "\\u%04X\\u%04X",
> -                                       0xD800 + ((cp - 0x10000) >> 10),
> -                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
> -            } else if (cp < 0x20 || cp >= 0x7F) {
> -                g_string_append_printf(writer->contents, "\\u%04X", cp);
> -            } else {
> -                g_string_append_c(writer->contents, cp);
> -            }
> -        }
> -    };
> -
> +    mod_utf8_sanitize(writer->contents, str);
>      g_string_append_c(writer->contents, '"');
>  }
> 
> diff --git a/util/unicode.c b/util/unicode.c
> index 8580bc598b3..a419ed4de35 100644
> --- a/util/unicode.c
> +++ b/util/unicode.c
> @@ -154,3 +154,87 @@ ssize_t mod_utf8_encode(char buf[], size_t bufsz, int 
> codepoint)
>      buf[4] = 0;
>      return 4;
>  }
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + *
> + * Append the contents of the NUL-terminated Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters.  The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize(GString *buf, const char *str)
> +{
> +    mod_utf8_sanitize_len(buf, str, -1);
> +}
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + * @len: Length of @str, or negative to stop at NUL terminator
> + *
> + * Append the contents of @len bytes of the Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters.  The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len)
> +{
> +    const char *ptr;
> +    char *end;
> +    int cp;
> +
> +    if (len < 0) {
> +        len = strlen(str);
> +    }
> +
> +    for (ptr = str; *ptr; ptr = end) {
> +        cp = mod_utf8_codepoint(ptr, MIN(6, str + len - ptr), &end);
> +        switch (cp) {
> +        case '\"':
> +            g_string_append(buf, "\\\"");
> +            break;
> +        case '\\':
> +            g_string_append(buf, "\\\\");
> +            break;
> +        case '\b':
> +            g_string_append(buf, "\\b");
> +            break;
> +        case '\f':
> +            g_string_append(buf, "\\f");
> +            break;
> +        case '\n':
> +            g_string_append(buf, "\\n");
> +            break;
> +        case '\r':
> +            g_string_append(buf, "\\r");
> +            break;
> +        case '\t':
> +            g_string_append(buf, "\\t");
> +            break;
> +        default:
> +            if (cp < 0) {
> +                cp = 0xFFFD; /* replacement character */
> +            }
> +            if (cp > 0xFFFF) {
> +                /* beyond BMP; need a surrogate pair */
> +                g_string_append_printf(buf, "\\u%04X\\u%04X",
> +                                       0xD800 + ((cp - 0x10000) >> 10),
> +                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
> +            } else if (cp < 0x20 || cp >= 0x7F) {
> +                g_string_append_printf(buf, "\\u%04X", cp);
> +            } else {
> +                g_string_append_c(buf, cp);
> +            }
> +        }
> +    }
> +}
> -- 
> 2.45.2

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be public

Reply via email to