On Thu, Aug 08, 2024 at 09:54:26AM GMT, Markus Armbruster wrote:
> Eric Blake <ebl...@redhat.com> writes:
> 
> > My next patch needs to convert text from an untrusted input into an
> > output representation that is suitable for display on a terminal is
> > useful to more than just the json-writer; the text should normally be
> > UTF-8, but blindly allowing all Unicode code points (including ASCII
> > ESC) through to a terminal risks remote-code-execution attacks on some
> > terminals.  Extract the existing body of json-writer's quoted_strinto
> > a new helper routine mod_utf8_sanitize, and generalize it to also work
> > on data that is length-limited rather than NUL-terminated.
> 
> This is two changes in one patch: factor out, and generalize.
> 
> You don't actually use the generalized variant.  Please leave that for
> later, and keep this patch simple.

Makes sense. Will simplify in v2.

> 
> >                                                             [I was
> > actually surprised that glib does not have such a sanitizer already -
> > Google turns up lots of examples of rolling your own string
> > sanitizer.]

See https://gitlab.gnome.org/GNOME/glib/-/issues/3426 where I asked if
glib should provide a more generic sanitizer.  In turn, I found glib
does have:

char*
g_uri_escape_string (
  const char* unescaped,
  const char* reserved_chars_allowed,
  gboolean allow_utf8
)

which is a bit more powerful (you have some fine-tuning on what gets
escaped), but a different escaping mechanism (%XX instead of \uXXXX
escapes, where % rather than \ is special).  I'm not sure whether it
is nicer to have a malloc'd result or to append into a larger g_string
as the base operation (and where you could write an easy wrapper to
provide the other operation).

> > +/**
> > + * mod_utf8_sanitize:
> > + * @buf: Destination buffer
> > + * @str: Modified UTF-8 string to sanitize
> > + *
> > + * Append the contents of the NUL-terminated Modified UTF-8 string
> > + * @str into @buf, with escape sequences used for non-printable ASCII
> 
> "Append into" or "append to"?

"append to" works, will simplify.

> 
> > + * and for quote characters.  The result is therefore safe for output
> > + * to a terminal.
> 
> Actually, we escape double quote, backslash, ASCII control characters,
> and non-ASCII characters, i.e. we leave just printable ASCII characters
> other than '"' and '\\' unescaped.  See the code quoted below.
> 
> Escaping '\\' is of course necessary.
> 
> Escaping '"' is necessary only for callers that want to enclose the
> result in double quotes without ambiguity.  Which is what the existing
> caller wants.  Future callers may prefer not to escape '"'.  But we can
> worry about this when we run into such callers.

If we encounter more users, I could see this morphing into a backend
that takes a flag argument on knobs like whether to escape " or ',
whether to preserve printing Unicode, ...; coupled with frontends with
fewer arguments that pass the right flags intot the backend.

> 
> > + *
> > + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> > + * "\xC0\x80".
> > + */
> 
> Missing: behavior for invalid sequences.  Each such sequence is replaced
> by a replacement character U+FFFD.  For the definition of "invalid
> sequence", see mod_utf8_codepoint().

Indeed, more documentation here wouldn't hurt (both on \ and " being
escaped, and on U+FFFD codepoints being placed into the output
stream).

> 
> On the function name.  "Sanitize" could be many things.  This function
> actually does two things: (1) replace invalid sequences, and (2) escape
> to printable ASCII.  What about append_mod_utf8_as_printable_ascii()?
> Admittedly a mouthful.

Naming is always tough.  Your suggestion is longer, but indeed
accurate.  Maybe a shorter compromise of append_escaped_mod_utf8()?

> 
> > +void mod_utf8_sanitize(GString *buf, const char *str)
> > +{
> > +    mod_utf8_sanitize_len(buf, str, -1);
> > +}

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to